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Post-Regularization Confidence Bands for High 
Dimensional Nonparametric Models with Local Sparsity 
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Abstract 

We propose a novel high dimensional nonparametric model named ATLAS which 
naturally generlizes the sparse additive model. Given a covariate of interest Xj, the 
ATLAS model assumes the mean function can be locally approximated by a sparse 
additive function whose sparsity pattern may vary from the global perspective. We 
propose to infer the marginal influence function /* (z) = E [f(Xi ,..., X d ) \ Xj = z} using 
a new kernel-sieve approach that combines the local kernel regression with the B-spline 
basis approximation. We prove the rate of convergence for estimating f* under the 
supremum norm. We also propose two types of confidence bands for f* and illustrate 
their statistical-comptuational tradeoffs. Thorough numerical results on both synthetic 
data and real-world genomic data are provided to demonstrate the efficacy of the theory. 


1 Introduction 


Nonparametric regression investigates the relationship between a target variable Y and many input 
variables X = (X \,... ,X d ) T without imposing strong assumptions. Consider a model 


Y = f(X)+e, 


( 1 . 1 ) 


Wasserman 


2006 


where X E R d is a d-dimensional random vector in X d , e is random error satisfying E[e | X] = 0, 

l d M. When d is small, 
However, the interpretations 
of these models is challenging. When d is large, consistently fitting /(•) must rely on additional 
structural assumptions due to the curse of dimensionally. 

One of most popular structural assumptions is that / (• 


and Y is a target variable. We aim to infer the unknown function / : 
fitting a fully nonparametric model ED is feasible 


takes an additive form Friedman and 


Stuetzle 

1981: 

Stone. 

1985 

Hastie and Tibshirani 

1990 


More specifically, we assume that 


d 

Y = n + X and [/P0)] = 0, (1.2) 

3 = 1 
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where /i is a constant and fj (•) is a smooth univariate functions. Under the assumption that only s 


components are nonzero (s <C d), significant progress has been made to understand adc 

itive models 

in high dimensions Sardy and Tseng 

2004 

Lin and Zhang 2006 

Ravikumar et al. 

2009 

Meier 

and van de Geerand Peter Biihlmann 

2009 

Huang et al. 2010 Ko 

tchinskii and Yuan 

2010 

Kato 

2012 Petersen et al. 2014 Lou et al. 

2014 



This paper proposes a novel model family which only need a sparse additive representation 
locally. Heuristically, we assume 

f{x 1 , ■ ■ ■ ,x d ) « E fj{xj ) in a neighborhood of (aq,..., Xd), 

j£S 

where both the set of active components S and the approximating functions {./}(')IjeWl can vary 
with the change of neighborhood. We term our model as AddiTive Local Approximation with 
Sparsity (ATLAS). A formal definition of the ATLAS model is given below in Definition 2.2 
Compared to the sparse additive model in 1.2 , the ATLAS model allows us to handle complicated 
interactions. More specifically, it contains functions of the form f(x i,... ,Xd) = Ylj= l /j( x i , x j)- 
where for any fixed xi, at most s of {fj(x i, Oljetd] are nonzero. Under the ATLAS model, we aim 
to estimate the marginal influence of the first input variable X\ (without loss of generality) on the 
target Y. In particular, we are interested in estimating ff(z) = E [f(X i, ..., Xd) \ X\ = z] in a high 
dimensional setting and provide confidence bands for it. 

The main results of our paper are threefold. First, we obtain uniform estimation rate of 
convergence for the kernel-sieve estimator under the ATLAS model. This is a novel nonparametric 
regression method that combines the local regression with the sieve method. If the marginal 
influence /* belongs to the second order Holder class (see Definition 


2.1 


our kernel-sieve estimator 


fi achieves the following uniform rate ||/i — ff\\oo = Op(n~ 1//3 y!og(dn)), where || • ||oo is the 
supremum norm. Second, we propose two ways to construct confidence bands for the marginal 
influence /)*(•): (1) The first method exploits the extreme value theory. (2) The second method 
uses the Gaussian multiplier bootstrap. The confidence band of the first type is based on the 
asymptotic distribution of the suprema of an empirical process. A Gumbel type limiting distribution 


can be derived based on the strong Gaussian approximation technique developed in Bickel and 


Rosenblatt 

1973 

and 

Hardle 

1989 


This type of confidence band is easy to compute, however, its 
coverage converges only logarithmically in the sample size to the nominal value. For the confidence 
band constructed by a Gaussian multiplier bootstrap, though less efficient in computation, we 
show that its coverage has a polynomial convergence rate to the nominal value and therefore has 
better performance in finite samples. To establish the validity of the proposed confidence bands 
we develop three new technical ingredients: (1) the analysis of the suprema of a high dimensional 
empirical process that arises from kernel-sieve hybrid regression estimator, (2) a de-biasing method 
for the proposed estimator, and (3) the approximation analysis for the Gaussian multiplier bootstrap 
procedure. The supremum norm for our estimator is derived by applying results on the suprema of 
empirical processes Koltchinskii 2011 van der Vaart and Wellner 1996: Bousquet 2002 . The 
de-biasing procedure for the kernel-sieve hybrid regression estimator extends the approach used 
in the t\ penalized high dimensional linear regression 


2014 


Javanmard and Montanari 


2014 


Zhang and Zhang 


2013 


van de Geer et al. 


Compared to the existing literature, this is the first work 
considering the de-biasing procedure for a high dimensional nonparametric model. To prove the 
validity of the confiden ce band constructed by the Gau ssian multiplier bootstrap, w e generalize 

to the high 


Chernozhukov et al. 

2014a 

and 

Chernozhukov et al. 

2014b 
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dimensional ATLAS model. Our third contribution is the construction of confidence bands for the 
sparse additive models under two kinds of identifiability conditions. Since the sparse additive model 
f(x i,..., Xd) = J2j=i fj( x j) i s a subset of the ATLAS model, we can apply the same procedure for 
ATLAS to construct the confidence band for fl(z) = E[ \Xi = zj. However, in the 

sparse additive model literature, it is common to estimate /i under the identifiability condition that 
E[/i(Ai)] = 0. In order to estimate and construct confidence bands for /i, we modify our procedure 
as follows. In the estimation step, we still use a sieve-kernel regression, however, we develop a novel 

to correct 


nonparametric estimator similar to the CLIME estimator proposed in Cai et al 


2011 


the bias. In the inference step, we exploit a modified Gaussian multiplier bootstrap procedure to 
provide valid confidence bands with polynomial convergence rate to the nominal coverage. 


1.1 Related Literature 

Our work contributes to two different areas. For both areas, we made new methodological and 
technical contributions. 

First, we contribute to the literature on high dimensional nonparametric estimation, which has 


recently seen a lot of activity. 

Lafferty and Wasserman 

2008 , 

Bertin and Lecue 

2008; 

Comminges 

and Dalalyan 

2012 

, and 

Yang and Tokdar 

2014 study variable selection in 

a high dimensional 


nonparametric regression setting without assuming structural assumptions on /(•) beyond that it 


depends only on a subset of variables. A large number of papers have stuc 


model in 1.2 Sardy and Tseng 

2004 

Lin and Zhang 

2006 

Avalos et al. 

2007 

Ravikumar et al. 

2009 

Meier and van de Geerand Peter Biihlmann 2009 

Huang et al. 2010: 

Koltchinskii and Yuan 

2010 

Raskutti et al. 

2012 Kato 

2012 

Petersen et al. 

2014: 

Rosasco et al. 

2013 

Lou et al.. 2014 

Wahl 

2014 . In addit 

ion, Xu et al 

2011 

study a high dimensional convex nonparametric regression 


ied the sparse additive 


Dalalyan et al. 


2014 


study the compound model, which includes the additive model as a special 
case. Our approach differs from the existing literature in that we consider the ATLAS model, in 
which the additive model is only used as an approximation to the unknown function /(•) at a fixed 
point z and allow such approximation to change with z. Our approach only imposes a local sparsity 
structure and thus allows for more flexible modeling. We also develop a novel method for estimation 


and inference. 

Meier and van de Geerand Peter Biihlmann 

2009 

Huang et al. 

2010 

Koltchinskii 

and Yuan 

2010 , 

Raskutti et al. 

2012 , and 

Kato 

2012 d 

evelop estimation schemes mainly based 


on the basis approximation and sparsity-smoothness regularization. Our estimator approximates 
the function locally using a loss function combining both basis expansion and kernel method with 
a hybrid /^-penalty. Our theoretical analysis also provides novel technical tools that were not 
available before and are of independent interest. 

Second, we contribute to a growing literature on high dimensional inference. Initial work on 
high dimensional statistics has focused on estimation and prediction (see, for example, Biihlmann 


and van de Geer 


2011 


for a recent overview) and much less work has been done on quantifying 
uncertainty, for example, hypothesis testing and confidence intervals. Recently, the focus has 
started to shift towards the latter problems. Initial work on construction of p-values in high 
dimensional models relied on correct inclusion of the relevant variables Wasserman and Roeder 


2009 


2013 


Meinshausen et al. 


2009 


Meinshausen and Biihlmann 


2010 


and Shah and Samworth 


study stability selection procedure, which provides the family-wise error rate for any selection 
procedure. Hypothesis testing and confidence intervals for low dimensional parameters in high 


dimensional linear and generalized linear models are studied in Belloni et al. 


2013a 


Belloni et al. 
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2013c 

van de Geer et al. 

2014 

Javanmard and Montanari 

2014 

Javanmard and Montanari 

2013 , 

and 

Farrell 

2013 . 1 

diese methods construct honest, uniformly valid confidence intervals and 


rypothesis test based on the i\ penalized estimator in the first stage. Similar results are obtained 


in the context of t\ penalized least absolute deviation and quantile regression 

Belloni et al. 

2015, 

2013b 

Kozbur 

2013 

extends the approach developed in 

Belloni et al. 

2013a 

to a nonparametric 


regression setting, where a pointwise confidence interval is obtained based on the penalized series 

studies construction of one-sided confidence intervals for groups of 

studies significance 


estimator. 


Meinshausen 


2013 


variables under weak assumptions on the design matrix. Lockhart et al. 


of the input variables that enter the model along the lasso path. 

perform post-selection inference conditional on the se 

and 


et al. 

2014 

Lahiri 

2013 


2014 


Lee et al. 


ected model. 


2013 


Liu and Yu 


2013 


Chernozhukov et al. 


2013 


Lopes 


2014 


and 


Taylor 


Chatterjee and 


study properties 

of the bootstrap in high-dimensions. Our work is different to the existing literature as it enables 
statisticians to make global inference under a nonparametric high dimensional regression setting for 
the first time. 


1.2 Organization of the Paper 

This paper is organized as follows. In the next section, we formally describe the ATLAS model 
and give several concrete examples of functions in the model family. In Section [ 3 ] we introduce 
the penalized kernel-sieve hybrid regression estimator as a solution to an optimization program. 
Section [4] provides the theoretical results on the statistical rate of convergence of the estimator and 
proposes two methods for constructing confidence bands based on the asymptotic distribution of 
the suprema of an empirical process and the Gaussian multiplier bootstrap. In Section [5] we discuss 
the construction of valid confidence bands for sparse additive models under another identifiability 
condition. The numerical experiments for synthetic and real data are collected in Section [6] 

1.3 Notation 

Let [n] denote the set {1 ,,n} and let !{•} denote the indicator function. For a vector a E R d , we 
let supp(a) = {j | cij / 0} be the support set (with an analogous definition for matrices A E M niXn2 ), 
||a|| 9 , for q E [Loo), the t^-norm defined as ||a|| 9 = (X^e[ n ] \ a i\ q ) l ^ q with the usual extensions 
for q E {0,oo}, that is, ||a||o = |supp(o)| and ||a||oo = majq e [ n ] |aj|. If the vector a E is 
decomposed into groups such that a = (ag 17 ... ,ag g ) T , where Gi, • ■ ■, G g C [d] are disjoint sets, 
we denote ||a||p )9 = Ylk =1 ll a Gjlp an< I ll a llp,oo = max fcg r 9 ] HogJIp f° r an y PiQ £ [l,oo). We also 
denote the set { 1 ,..., j — 1 , j + 1 ,..., d} as \j and the vector a\j = (a\, ..., aj-i, aj + \ ..., ad) T . 
For the function / E L 2 (M), we define the L 2 norm \\fW 2 = [f f 2 (x)dx ] 1//2 and the supremum 
norm ||/||oo = su Pa;eR \f( x )\- For a matrix A E M niXri2 , we use the notation vec(A) to denote the 
vector in M niri2 formed by stacking the columns of A. We denote the Frobenius norm of A by 
II a IIf = ,je[n 2 ] ancl den °f e H 16 operator norm as ||A|| 2 = sup|| v || 2=1 ||Av|| 2 . For two 

sequences of numbers and we use a n = 0((3 n ) to denote that a n < Cj3 n for some 

finite positive constant C, and for all n large enough. If a n = 0(/3 n ) and /3 n = 0(a n ), we use the 
notation a n x j3 n . The notation a n = o(/3 n ) is used to denote that a n fi~ l — —0. Throughout the 
paper, we let c, C be two generic absolute constants, whose values may change from line to line. 
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2 ATLAS: Additive Local Approximation Model with Sparsity 


In this section, we first define the additive local approximation model with sparsity (ATLAS). We 
then discuss properties of ATLAS model and its connections to sparse additive models. 

First, we introduce the Holder class 'H('y,L) of functions. 


Definition 2.1. The 7 -th Holder class ^( 7 , L) on X is the set of i = [ 7 ] times differentiable 
functions / : X 1 —> M, where [ 7 ] represents the largest integer smaller than 7 . The derivative /^ 
satisfies 


\f [e) {x) ~ f (e) (y)\ < L\x - y | 7 for any x,y G A. 


Let X = (Ad,... ,Xd) T be a d-dimensional random vector in X d . The sparse additive model 
(SpAM) is of the form given in 1.2 , with only a small number of additive components nonzero. 


Let S C [d] be of size s = |5| -C d, then the sparse additive model can be written as 


Y — H + Y. fj (Xj) + e 

j£S 


( 2 . 1 ) 


with fj G 'H{2,L). Notice that under the SpAM model, there are no interaction terms between 
different covariates. In addition, the set of covariates in S affect the response Y globally. The 
ATLAS model relaxes these two structural constraints. 


Definition 2.2. A d-dimensional function f(x 1 ,... ,Xd ) has a local sparse additive approximation 
for the jo-th coordinate if for any z E X, there exist functions • • •, fdz{-) £ %(2, L ), two 

bounded functions Lj 0 (-) : X d 1-7 M, Qj 0 (-) : X i-t- M and a constant do > 0 such that for any 
x \jo = ( Xl > • ■ ■ > x jo+i> ■ ■ ■) x d) T ^ X d ~ l , if Xj 0 G (z — So, z + do), we have the approximation 


d 

f(xi,..., x d ) - fjz{Xj ) - L jo (z , X\ jo )(x jo 
1=1 




< Q jo (z)(xj 0 


( 2 . 2 ) 


Moreover, we assume that the locally additive approximation functions are sparse such that at 
most s of /j Z (-)’s are not identically zero. The sparsity pattern at each z G X is denoted as 


the ATLAS model for the jo-th coordinate and denote it as A r j(s, jo). 

Without loss of generality we consider the ATLAS for the first covariate throughout the rest 
of the paper and simply denote Ad(s, 1) as M^(s). We name X\ as the longitude variable and the 
local d approximation functions f± z , ..., fdz f° r each z G X as charts at longitude z. Notice that the 
sparsity patterns of charts may change with z G X, allowing for more flexible modeling compared to 
SpAM with fixed sparsity pattern. Furthermore, the charts {fjz}j = i can also vary with z. Therefore, 
ATLAS allows complex nonlinear interaction between X± and other covariates. A visualization of 
a d-dimension function in ATLAS is illustrated in Figure [Tj For each z € X, the charts {fj z }j = i 
are locally defined on the interval (z — do, z + do), and by “gluing” the local charts {fj z }j= 1 over 
different z G X we assemble the function in ATLAS. It is obvious that the sparse additive model is 
a subset of ATLAS with the fixed charts {fj} d = 1 which are invariant to any longitude variable. The 
following example shows that ATLAS is much richer than SpAM. 


S z = {j G [d] : fjz(-) ^ 0}. We call the function class containing functions satisfying Definition 


2.2 
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charts : 



Figure 1: The illustration of ATLAS. As the longitude variable X\ changes as X\ £ {z\, Z2, Z3, Z4}, 
the sparsity patterns of the charts are different. By fixing the lattitude variable Xj for j = 2,... ,5, 
the values of charts f JZ change with X\ . Under the sparsity assumption, fj z is zero for most of the 
range of X\. 


Example 2.3. Consider a d-dimensional function with the structure 


f{xi, ...,x d ) = fi{xi) + 


(2.3) 


3 =2 


where a,j(-),fk(-) £ T~L(2,L) for all k £ [d] and j > 2. Moreove r, for any fixed z £ X, at m ost s 
of {aj(z)}j> 2 are nonzero. We can show that the function in 


2.3 


satisfies Definition 


2.2 


We 


define fj Z (xj ) = a,j(z)fj(xj) for j = 2and f\ z (x 1 ) = fi(xi) as charts. In addition, let 
Li(z,x v ) = E j > 2 a j(z)f j {x j ). 

Therefore for any x\ £ (z — <5q, 2 + 5q) and £ X d ~ l , we have 


f(x 1 ,.. .,x d ) ~^2fj Z (xj) - Ti(^,a;\ 1 )(a;i - z) 
3 = 1 


< smax ||/j||oo||aj-||oo(aJi - z) 2 := Qi(z)(x 1 - z ) 2 , 
tern 


which satishes Definition 


2.2 


2.3 


allows nontrivial 


if s is finite. The nonparametric function in 

interactions between X\ and Xj for j > 2, which cannot be modeled with SpAM. The sparsity 

originates from a,j{x 1 ) and there is no structural assumption on fj{xj). If 

as 


of the function in 


2.3 


fj(xj)'s are linear functions for all j > 2 , we can write 


2.3 


f(x 1 , ...,x d ) = fi(xi) + J^ajix^Xj, 

3 =2 


(2.4) 


which is a high dimensional varying coefficient linear model, where the support of the linear 
coefficients may vary with x\. From this perspective, ATLAS is also a nonparametric generalization 
of the sparse varying coefficient linear model. 

Motivated by Example 12.31 we have the following proposition on the structure of ATLAS model. 

Proposition 2.4. For any d-dimensional function f(x 1 ,... ,x d ) £ ^(s), there exists d bivariate 
functions {fj(xj, x 1 ) } d = ] such that f(x 1 ,..., x d ) = Ej=i fj( x ji x i)- Moreover, for any fixed z £ X, 
the charts at z are fj Z {x) = fj(x, z) for j = 1 ,..., d and at most s of z)}j= 1 are nonzero. 
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Proof. We choose arbitrary z 
the both sides of 


2.2 


G X . According to the definition of Ad(s), we let x\ go to z on 
and derive that f(z,x 2,..., x<j) = J 2 j=1 Since z is arbitrary, we 

let fj(xj,x 1 ) = fjx\ ( x-i) fo r any x \, Xj G A and j = 1,d. Due to the properties of the charts 

□ 


{ fjz } j— 1 hi Definition 
Proposition 


2.2 


the proposition is proved. 


2.4 


shows that the ATLAS model belongs to the bivariate additive model family 
and all these bivariate functions share one common covariate. Therefore, we can also interpret 
the ALTAS as a varying coefficient sparse additive model, which generalizes the varying coefficient 


sparse linear model in 2.4 

. However, ATLAS is not covered by the compound functional model 

proposed in 

Dalalyan et al. 

2014 

as the sparsity pattern of ATLAS varies with the shared covariate, 


while the sparsity pattern of compound functional model is fixed. 

To make the charts {fj z }j= 1 identifiable for any z G X, we impose the following condition. For 
any z G X, the charts at z satisfy 


E [fj z (Xj) \Xi = z] =0 for any j = 2 , ..., d. 


(2.5) 


We defer the validity proof of the above indentiability cond ition to Appendix |C. 11 in the supple 


1.2 


2.5 


mentary material. Under the sparse additive model in 
is E [fj(Xj)] = 0, which is different from the one in 
fi (z) = E[/(Xi,..., X ( i) | X\ = z], which is naturally identified under 

E [fj z (Xj)\ = 0. We also discuss the estimation and inference aspects under the latter condition m 
Sectionfhl 


a common identifiability condition 
However, our goal is to estimate 
instead of the condition 


2.5 


3 Penalized Kernel-Sieve Hybrid Regression 


In this section, we describe a nonparametric estimator combining the local kernel regression with the 
B-spline based sieve method. Our kernel-sieve hybrid regression applies the local kernel regression 
over the longitude variable and B-spline regression on the rest of the covariates. The group lasso 
penalty is used to shrink the coefficients in the expansions and select relevant variables locally. 

Let {(Xj, y)}f =1 be n independent random samples of (X,Y) distributed according to ED 


where the unknown function / belongs to Ad(s) in Definition 
of f(x 1 ,... ,Xd) at z G X. Before describing our estimator, we 
functions that will be used in the estimation. Let {(fi 
functions 


2.2 


Let {fjz}je[d\ be the charts 


Schumaker 


2007 


first introduce the centered basis 
} be the normalized B-spline basis 
Given m basis functions, we denote fj m ;z(x ) as the projection of fj z 


onto the space spanned by the basis, B rn = Span(^i,... ,4> m ). In particular, we define 


fjm-A') : = argmin \\f - f jz || 2 = ^ (3* jk ^* jk . z {-), 

f£B m 


k =1 


where Ajk-A are the locally centered bases defined as 

= <Pk{x) ~ E[( fik(Xj) | X 1 = z], for all j G [d],m G [k\. 


(3.1) 


(3.2) 


Notice that basis functions {' l Pjk- z }je[d\,ke[m\ satisfy E [f’jk- z (Xj) \ X\ = z] = 0 for any z G X and 

To compute '0 * fc;2 


2.5 


this property ensures that fj m - z (-) also satisfies the identifiability condition 
we need to estimate the unknown E (Xj) \ X\ = z]. Let the kernel function K : X 1 —> M be a 
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symmetric density function with b ounded support and denote Kh{-) = h 1 K(-/h ) where h > 0 is 
the bandwidth. We estimate 


3.2 


by 


ipjk-,z{x) = 4>k(x) - (t>jk{z) with 4>jk(z) = 


E "=1 MXij)K h (Xij - z) 

£?=i KbiXv-z) 


(3.3) 


Here (j)jk(z) is a Nadaraya-Watson kernel estimator of E [(f>k(Xj) \ X\ = z\. 

Now, we are ready to introduce the penalized kernel-sieve hybrid regression estimator. The 
kernel-sieve hybrid loss function at a fixed point 2 G X is given as 


C z (a, P) = ~Y, K h( x i 1 ) , (3.4) 

n »=i V j=2 k =i y 

' Let/3 = (/3 2 r ,...,/3j) T G 


d m 


3.3 


where Y = n _1 and the basis functions ipjk-, z are defined in 

^(d-i)m p. — ( / g jl) ... ) (3j rn ) T G M m be the coefficients of B-spline basis functions. The 

penalized kernel-sieve hybrid estimator at z E T is defined as 


{(Xz,Pz) = arg min C z (a, (3) + XJl(a,/3), 

a,(3 


where the penalty function is 


lZ(a, (3) = \fm ■ |a| + 

3> 2 


(3.5) 


(3.6) 


with A being a tuning parameter. We estimate the charts {fjz}j£[d\ by fiz(z) = a z and fj Z {x ) = 
'ipjk;z( x )/3]k-,z for j > 2. Here a- is also an estimator of the marginal influence function fi(z). 
Based on a z and /3 Z , we can estimate the d -dimensional function f(z, X 2 , ■ ■ ■, xd) by 


d m 


f(z, X 2 , ■ ■ • , x d ) = a z + ^2 ^2 1pjk',z (Xj )Pjk;z 

j=2 k=1 


(3.7) 


where /3jk- z i s the coordinate of (3 Z corresponding to the fc-th B-spline basis of the j-th covariate. 
Remark 3.1. The estimators a z and (3 Z are estimating different quantities. Notice that a z estimates 


the value of f * at a point, w 
15 in Chapter XI of de Boor 


rile p z estimates the coefficients of B-splines. According to Corollary 
, given a function g(x) = Yl'k= l Pk4>k(x), we have 


2001 


llsll 


m 


-‘E* 


(3.8) 


k= 1 


From this we see that the scales of a z and p z are different, which explains the additional \Jm term 


multiplying |ck| in the penalty function 1 3.6 


As discussed in the introduction, there are a number of methods proposed for estimating 
nonparametric functions. For example, we could estimate /* by the local linear regression estimator 
defined as 


Fan 


1992 


{fi c (z),b z ) = arg min K h (Xu - z)\Yi - a - b(X a - z)] 2 . 
a ’ b i =l 















Under the assumption that the conditional variance o 2 (z ) = Var(Y | X\ = z) is bounded and 
continuous, Fan 1992 shows that there exists a constant C(z) only related to the marginal influence 


fi(z) and X] : s marginal density function pi(z) such that the mean square error 

E[(/} oc (z) - fit*)) 2 I * 11 , • • -,X ln ] < C(z)a 2 (z)n ~ 4 / 5 , 

where cr 2 (z) = Var[/(Ai,..., X d ) + e | X\ = z\. However, under the ATLAS model, the conditional 
variance & 2 (z) can be potentially unbounded, preventing us from obtaining a uniform rate for the 
local linear estimator. Furthermore, by considering the effects of other covariates, we can reduce 
the variance of the estimator even in the case when the conditional variance is bounded. Another 
widely used nonparametric method is the sieve estimator. For the sparse additive model in 
consider minimizing 


Huang et al. 


2010 


1.2 


1 


n dm 2 d 

P SpAM = argmin — (y] — Y — EE V’j k (X-ij ) f3j fc'j + *En& 


i—1 


3 =1 k= 1 


(3.9) 


3 =1 


where ipjk{x) = pk(x) -n to estimate / by /s p am(z) = X)feLi faHfa' Their 

estimator uses the B-spline approximation for all additive functions under SpAM in 2.1 and cannot 


be applied to ATLAS. The same comment applies to other procedures for estimating SpAM using 


the global approximation method, including 

Ravikumar et al. 

2009 

, Meier and van c 

ie Geerand 

Peter Biihlmann 

2009 

Huang et al. 

2010 

, Koltchinskii and Yuan 

2010 , and 

Kato 

2012 . 


3.1 Computational Algorithm 


In this section, we describe an algorithm to minimize 


3.5 


We start by introducing some extra 


notation. Denote = (T'i., ..., \tfo.) T E M nx ( 1 “ l ~( d ~ 1 )' 7 ^, where Thj = (ipji(Xij),... ,ipj m (Xij)) rj 
and \Eq. = (1, .... '&J d ) T E M 1+ ( d_1 ) m for i E [n\ and j > 2. We also write T' = (TGi, ..., IF.d), 

where T'.i = (1,..., 1) T E M n and T'.j = (SDUj, ■ • •, T'm) T E 


■ njj 


Y = (li - Y, 
(3* + = (r 1 (z),f3* T y e 


,Y n -Y) T E 


nxm for j > 2. We further denote 

P+ = {a, 0 r ? 


3 T \ T £ + 


T\ T ^ R l+(d-l)m and 


W 2 = diag(A h (A n -z),..., K h (X nl - z)) E 


To unify the notation in our algorithm, we also write /3+ = (J3\. ,..., f3 d , where (3\ = a and 

(3 = (02 •> ■ • ■ i Pji) T ■ We also define the tuning parameters A j = A \fm fo 1 - A — 1 
j > 2. Using the above notation, we can rewrite the objective function in 


3.5 


as 


1 


C z ((3 + ) + XR((3 + ) = -(Y- Vfi+y W Z (Y - *0+) + XR((3 + ). 


n 


(3.10) 


Finally, denote the gradient V jC z (f3 + ) := dC z (/3 + )/d(3j. For different values of z, we denote the 
components of /3+ corresponding to the fc-th B-spline basis for the j-th covariate as (3jk- 
We minimize the objective function in 


composite functions (RCDC) proposed in 
are given in Algorithm 


3.10 


using the randomized coordinate descent for 
Details of the procedure 


Richtarik and Takac 


2014 


Suppose the result of the f-th iteration is /3+ . In the next iteration, we 
randomly choose one coordinate jt+i from {1,..., d} and update the (3 Each update in 
can be obtained in a closed form as 


3.11 


npf) = r Aj> (/ 3f - 3+)) - Pf, 


(3.12) 
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Algorithm 1 Randomized coordinate descent for group Lasso 
for t = 1,2 ,... do 

Let (3¥ = (P?,P? T ,...,P¥ )T ) T . 

Choose jt = j £ [d] with probability 1/d. 

Compute T(f3^) for the j-th block as 

T(,pf)= argmin j ^\\0\\l + 0) + A,-||0 + (3® || 2 ). (3.11) 

06R dim (^) L 2 j 

Update (3\ t+1) = (3f + T{pf). 

end for 


where p, is certain regularized constant and 7a is the soft-thresholding operator, which is defined as 
7a (v) = (v / 11 v 112 ) • max{0, ||v[ 2 — A}. If we evaluate the estimator a z for M different z’s, a naive 
approach is to run Algorithm 1 for M times. The computational complexity is 0(dm 2 nM). However, 
we propose a method to accelerate Algorithm[T]by exploiting the special structure of kernel functions. 
The accelerated method improves the computational complexity to 0{dm?{n + M)). Therefore, 
the computational complexity of our method is comparable to applying RCDC to minimize the 
objective function in 3.9 for SpAM estimation. More details can be found in Section[A]in the 


supplementary material. 


4 Theoretical Properties 


This section establishes the rate of convergence for the proposed estimator in Section 4.1 The 


confidence band for f* is constructed by the asymptotic distribution of a de-biased estimator in 
Section 


4.2 


Another confidence band by Gaussian multiplier bootstrap is discussed in Section 4.3 


4.1 Estimation Consistency 

We start with stating the required assumptions. Let p(x±,..., Xd) denote the joint density of 
X = (Xi,..., X c i) and let Pj(xj) denote the marginal density of Xj, for j e [d]. We also denote 
Pabc{x a -> %b, x c ) to be the joint density of (X a , Xf,, X c ) for a,b,c € [d], 

(Al) (Density function) The density function p(x 1 ,..., Xd) is continuous on X d and its support X 
is compact. For each j £ [d], the marginal density pj £ T~L(L, 2). There exist fixed constants 
0 < b < B < 00 such that b < pibc{xi,Xb, x c ) < B for all o, b £ {2,..., d}. 

(A2) (Kernel function) The kernel K(u) is a bounded and continuous density function satisfying 

K(u)du = 1 and / uK(u)du = 0. 

J x 


IX 


(A3) (Design Matrix) Let T, z = ’E\Kf l {X\ — ^)'4 r i,'4 r J.], recalling that T'i. = (1,^ 
For any J C [d], we define a cone 


12, 


C 


{K \j ) = {/3+ = (a,(3 T ) T | lWPjh < K Y^jej,rfi\\Pjh + KVrn\u\}- 


.^1 d) T - 
(4.1) 


10 











There exists a universal constant p m \ n independent to n,d,z such that the restricted minimum 
eigenvalue on C^\j) satisfies 


inf inf inf 

zeX l J l^ s /3+eC ( *\j) 


0j^+ 

\\(3\\l + ma 2 


^ Pmin 

— m 


(4.2) 


(A4) (Noise Term) The error term e satisfies E[e | X] =0 and is a snbgaussian random variable 
such that E[exp(Ae)] < exp(A 2 er 2 /2) for any A. 


(A5) (ATLAS) The nonparametric function f(x i,... ,x d ) G A d {s) dehned in Definition 


2.2 


Assumption (Al) on the density function p(-) of covariates is stronger than the one used in 
. They only assume that the univariate densities {pj{ x j)}j^[d\ is bounded away 


Huang et al. 


2010 


trorn infinity and zero. However, Assumption (Al) is reasonable for the kernel-type methods. For 

study the additive model with 


Opsomer and Ruppert 

1997 

and 

Fan and Jiang 

2005 


two covariates: Y = p, + fi(X\) + f 2 (X 2 ) + e and impose 


sup 

SBl,X2&X 


p(x 1 ,x 2 ) 


Pi(xi)p 2 {x 2 ) 


-1 


< 1, 


which implies that p(xi,x 2 ) is bounded from infinity and zero. Under the ATLAS model, it is 
reasonable to relax the boundedness assumption to the joint density of three covariates since it 
models more complicated interactions. Assumption (A2) is standard in the literature on local linear 


regression Fan 

1993 and Assumption (A4) is standard for the noise variable in SpAM Meier and 

van de Geerand Peter Biihlmann 

2009 Huang et al. 2010 

Koltchinskii and Yuan. 

2010 

Raskutti 

et al. 

2012 

Kato 

2012 . 


Assumption (A3) is similar to the restricted strong convexity condition in Negahban et al. 


2012 


Note that is the expectation of the Hessian matrix of the loss function £(/3+). We require S 

(k) 

to be positive definite when restricted to vectors in the cone '(</). Again, the additional factor 
y/m in front of |a| makes sure that a and (3 Z are calibrated on the same 
Assumption (A3) can be derived from the assumption on the design in 


scale (see Remark 

3.1 . 

Koltchinskii and 1 

Hi an 


2010 . They consider the quantity 

fa M) = mf [P > 0 | EjeJ HMI < Pi ZU hj\\ 2 2 , (h u ...,h d )eC 


(4.3) 


where C^ j (J) = {(hi, ...,h d ) \ \\hjh < \\ h jk] for J C [d]. 

The following proposition describes the connection between f3 2tK (J) and Assumption (A3). 


Proposition 4.1. We define a uniform quantity based on the constant 


4.3 


as 


= SU P|J|<sinf \P > 0 Ejej ll^illi < P 2 \\ Ej=i h j\\l> ( h u- ■ ■, h d ) G 
Under Assumption (Al), there exist constants c, C > 0 such that 

0lT, z (3 + ^ Cb 


(4.4) 


inf inf 


inf 


> 


zex\j\< S p +eC M {J ) 11/3111 + ma 2 sB 2 (ck + l) 2 0% tCK m 
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Proposition 4.1 implies that if the number of active components s is finite, fa,* < oo is sufficient 


to guarantee Assumption (A3). The assumption that s is finite is required in the previous works 


Meier and van de Geerand Peter Biihlmann 

2009: 

Huang et al. 

2010 : 

Koltchinskii and Yuan 

2010 : 

Kato 

2012 . The proof of Proposition 

4.1 

is 

statec 

in Appendix 

C.2 in the supplementary material 


In the following, we present the rate of convergence of the kernel-sieve hybrid regression estimator. 

Theorem 4.2. Suppose that Assumptions (A1)-(A5) are satisfied. There exists a constant C such 
that if h = o(l), m -» oo as n —> oo, and 


A > C 


log (dmh x ) 
nh 


+ 


rrv 


5/2 


m 3 / 2 log(d/i 4 ) 


h 2 


the estimator (a z ,(3j) T defined in 3.5 satisfies 

d 


sup 

z&X 


Eii^-^ii 2 < 


sm 


X 


and 


3 =2 


P min 


n 


n \ | ^ 8y/m 

sup | a z - (z)| < -A 

z&X Pmin 


(4.5) 


with probability 1 — c/n for some constant c > 0. Furthermore, the estimator / in 3.7 satisfies 

11/- fh < Pmin s Vrn\ (4.6) 

with probability 1 — c/n. 


m 


The estimation error comes from four sources. The noise e contributes O /\og(dmh 11 

5 / 2 ), comes from the approximation error introduced 


4.5 


The second term in 


4.5 


, O (y/s‘. 


sm 


by using m B-spline basis functions to estimate the true chart {fj z }j = 2 - The third source of error 
comes from the kernel method, which uses a constant to estimate f\ z locally. The fourth source 


of error comes from searching for correct local approximation by s additive functions due to 2.2 


Both the third and fourth sources contribute O (n 1 m 3//2 log(d/i 1 ) + K 2 / y/m) to the estimation 


error. The detailed proof of Theorem 4.2 is shown in Section 
The statistical rate in 


4.6 


8.1 


is minimized when we choose h x n 1/,fi 


m 


n 


x / 6 and A 


n 5 / 12 y / log (dn). With these choices, we obtain \\ f — f\\ 2 = Op (n 2 / 3 log (dn)). This convergence 
rate is slower than the optimal rate Op (n -4// ' 5 + log d/n) for estimating the sparse additive model 


Raskutti et al. 2012|. This is because the ATLAS model is essentially a two dimensional additive 


model according to Proposition! 2.4| Actually, we achieve the optimal rate up to logarithmic factors 


for the two dimensional Holder class 


Stone 


1980 


which is strictly covered by ATLAS model. 


Moreover, the rate can also be explained by the fact that the ATLAS model contains the sparse 
additive model as a special case and sparse additive functions are only used locally to approximate 
functions in the ATLAS model. Since the sparsity pattern vary with the longitude variable, the model 
selection is harder than in the context of sparse additive models. Specifically, the slower rate comes 
from the error term T n = sup 2e ^ maxjg^ ^ ||^TW z e ||2 = Of 

11 


Huang et al. 


2010 


n 


In comparison, 

only need to bound T' n = sup_ g< ^ maxjg^ ^||\&Te ||2 = Op ^ydog (dn)/nj (see 

their Lemma 2). Note that T n = Op (/i _1 / 2 T/) because the kernel matrix W 2 = diag(AA(A*n — 
z ),..., Kh(X n i — z)) in T n increases its variance by Op(h~ l / 2 ). Detailed technical analysis of T n is 
in given in Lemma [f 


.4 
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4.2 Confidence Bands 


We construct the confidence band for /* based on a de-biased estimator. A confidence band C n 
is a set of confidence intervals C n = { C n (z ) = [cl{z), cjj(z)] \ z E A}. For simplicity, we define the 
interval cq(z) ± tq(z) := [cq(z) — vq(z), cq(z) + ro(z)]. We use / E C n to denote that / lies in the 


confidence band, i.e., f{z) E C n (z 

) for all z E A. 

Zhang and Zhang 

2013 

van de Geer et al. 

2014 , 

and 

Javanmard and Montanari 

2014 propose de-biased estimators for high-dimensiona 

linear 


regression. 


van de Geer et al. 


2014 


also study generalized linear models. Asymptotic normality for 


a fixed subset of coordinates of the de-biased estimator is established. We extend their approach to 
high dimensional nonparametric model and consider a novel correction for a z that reduces the bias 

Let v 2 = ej /p\ (z), where ei is the first canonical 


3.5 


introduced by the local model selection in 
basis in M( d ~ 1 ) m+1 and p\(z) = n~ l YH=i ^h{^i 1 ~ z). We write the de-biased estimator as 


f?(z) =a z + -vf$ r W Z (Y - V/3+). 


n 


(4.7) 


We will show that the difference between the de-biased estimator /“ and the true function /* can 
be written as the following formulation 


Vnh - fU z j) = \fhfn- v^ T W z e + o P (l). (4.8) 


In order to rigorously show 

4. 

8 , we consider the following standard assumption for kernel 

regression 

Johnston 

1982 

Hardle 

1989 



(KR) We assume that the error e \ X = z is symmetric for any z E A. Let U = f*(X 1) + e and let 
fu denote the density of U. For some sequence {a n }^ =1 , a n —> 00 as n —> 00, we have 


h ^logn / fu{u)du 

J\u\>CLn 


0 ( 1 ). 


Theorem 4.3. Suppose that Assumptions (A1)-(A5) and (KR) are satisfied. If h x n 6 for 
1/5 < <5 < 1/3 and 

(251ogn) 1 / 2 + (251ogn) _1 / 2 [log(ci(A")/7r 1 / 2 ) + ^(log<5 + loglogn)] 

I if ci (K) = K\A ) + K 2 (-A)/[2\(K)} > 0; 
n ' (251ogn) 1 / 2 + (251ogn)~ 1//2 log(c2(A')/7r 1 / 2 ) 

k otherwise with C 2 (K) = f A A [K'(u)] 2 du/[2X(K )\, 


where X(K) 


K 2 {u)du , we have 


lim P ( (—2 log h) 1//2 | sup (nh^riz) |/“( 2 ) - rAz) 
n ^°° V [zex 



exp[—2exp(—w)], (4.9) 


where r(z) = 0 1 \/pi(z)/X(I\). Furthermore, we can construct the 100 x (1 — a)% simultaneous 
confidence band Cn, a = {Cn,a(z) \ z E A} such that 

Cn, a ( z ) '■= ± ( nh )~ 1/2 °l x ( K )/Mz)} 1/2 ( c* a /yJ-2\ogh + dn^j , (4.10) 

where c* = — log(— log(l — ck)/2), and a and p\ (z) are consistent estimators of a and pi(z). 
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4.3 Gaussian Multiplier Bootstrap Confidence Bands 


Theorem 4.3 

ru 

J 1 


given in the previous section, describes the asymptotic properties of the estimator 


However, it relies on Assumption (KR) that is restrictive and hard to verify. Furthermore, the 

converges to the nominal coverage at the rate 0(1/ log n 


coverage probability based on 


1991 


Hall 


1991 


4.9 


Hall 


shows that the slow convergent rate is due to the asymptotic approximation 
based on the Gumbel distribution, while the Gaussian approximation still converges with the 
polynomial rate, which leads to the development of Bootstrap methods that require the existence 
of the asymptotic distribution of certain studentized processes 
and Van Keilegom 2003 . Recently, 


Chernozhukov et al. 


2014a 


Bissantz et al. 

2007 

Claeskens 

and 

Chernozhukov et al. 

2014b 


developed a novel framework to analyze the Gaussian multiplier bootstrap for approximating the 
distribution of the suprema of an empirical process and apply it to kernel density estimation. We 
extend their approach and consider the Gaussian multiplier bootstrap to estimate the distribution 
of the process 

7 , x _ Vnh(fU z ) ~ fi ( z )) 

where pi(z) = n -1 Kfr(X t \ — z ) is the kernel density estimator of pi(z) and a = n~ l — 

/(Xji,..., X i( i)) 2 is the estimator of Var(e). In order to approximate the distribution of Z n , we 
consider the Gaussian multiplier process 


Gn(z) — —== 


£«< 

1=1 


a^/hF^MK^Xz) 


(4.11) 


where £i,... ,£ n are independent standard Gaussian random variables, which are also independent 
from observed data, and = Y t — f(Xn, ..., Xid), i = 1,..., n, is the estimator of the noise term 
£j. Let c n (a) be the 1 — a quantile of sup zg ^ G n (z). We construct the confidence band at level 
100 x (1 — a)% as C^ a = {C^ a (z) \ z £ X} where 

Cn,a ( z ) = /“(-) ±c„(a)(n/i)^ 1/2 5[A(Ji)/pi(2)] 1/2 . (4.12) 

We can show that this confidence band is asymptotically honest with polynomial convergent rate. 

Theorem 4.4. Under Assumptions (A1)-(A5), if h x n ~ s , 1/5 < 5 < 1/3, there exists a constant 
C > 0 such that we have the coverage of //(z) = E[/(A) | X\ = z\: 

P(/r e o > 1 - « - cvr 1/10 . (4.13) 

In particular, the confidence band C/ Q is asymptotically honest, that is, 

liminfP(/* € C b ) > 1 - a. 

n — 
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Remark 4.5. The confidence band constructed by the multiplier bootstrap does not require e to 
be symmetric nor assume the tail of fu as in the Assumption (KR). If we compare the confidence 
band Cn,a derived from the limiting Gurnble distribution and a from the Gaussian multiplier 
bootstrap, we have that the convergence rate to the nominal coverage is better for the Gaussian 
multiplier bootstrap, which will turn into better finite sample performance. We give a proof of 


Theorem 4.4 in Appendix B.l in the supplementary material. 


5 Confidence Bands for SpAM under the Identifiability Condition 

E lf:i(X 3 )] = 0 


2.1 


which is a special case of ATLAS. In 
we considered estimation and 


2.5 


In this section, we focus on the SpAM model in 
the previous sections, under the identification condition in 

construction of confidence bands for /*(■) = E[ )T^ =1 fj(Xj) f-Xi"= ■]. However, another popular 
identifiably condition for SpAM is to enforce 


E[f j (X j )} = 0, for all j = 1,... ,d. 


(5.1) 


More details on this identifiability condition can be found in 

Ravikumar et al. 

2009 

Meier and 

van de Geerand Peter Biihlmann 

2009 

Huang et al. 

2010 

Koltchinskii anc 

Yuan 

2010 , and 


Kato 


2012 


The identification condition in J2.5 
additive functions {fj(xj)}j e 


are identified under 


5.1 


is not equivalent to 5.1 . Suppose that the 


then the corresponding charts are 
d 


fi(xi) = fi(x 1 ) + ^[f j (X j )\X 1 = x 1 ] and f jz ( Xj ) = /,(*,) + I Xi = z}. (5.2) 

3 =2 3 =2 


Unless Xj,X\ are pairwise independent for all j > 2, /* is different from f\ for SpAM in 


2.1 


We apply a procedure similar to the one de scrib ed in Section [3] to construct valid confidence 


band for f\ under the identifiability condition in 
we redefine 


3.1 


and 


3.2 


5.1 


Since the identifiability condition is changed, 


by 


fjm(-) ■= argrnin \\f - fj || 2 = ^ f3* jk 3p* jk (-), 
fes m 


where ^* k {x) = <f> k (x) - E[(/> k (Xj)]. Similarly, we change (j) jk (z ) in (3.3) by (j> jk = n 1 Yh=i <!>k{Xij) 
and the new centered B-spline basis are i^ jk (x) = (f> k (x) — 4>j k . For simplicity, we use the same 


notation ^ij, and 4/ as in Section 
i E [n], j > 2, we redefine 


by substituting the new basis ijjj k {-). In particular, for 


1 ■ • ■ 1 'lpjm{Xij )) , — ( 1 , ■ ■ ■ 1 ^ id) 

Let ( a z ,/3 z ) be the kernel-sieve hybrid estimator given in |3.5 
B-spline basis Vh'fc(') into the loss function in 3.4 


We will obtain a result similar to Theorem 4.2 under the identifiability condition 


and V = (T'i,...,’ 4V) t . (5.3) 
, where we have plugged the new 

To 


achieve that, we need to modify Assumption (A3) since the B-spline basis are changed. 


5.1 
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(A3’) Let XL = ~K[Kh(X\ — z)\Iii,Vl/f.], where Vl/i. is redefined in 


5.3 


There exists a constant 

Ob 


Praia independent to n,d,z such that the restricted minimum eigenvalue on CJ, ' (J) satisfies 


inf inf inf 

zeX l J l^ s /3+eC^ ) (J) 


Pl'2zP+ 

||/3||| + ma 2 


^ Pmin 

— m 


(5.4) 


In Assumption (A3’) we replaced XL with XL. Although we use different procedures to center 
the basis functions under the identifiability condition 5.1 and 2.5 (4’jk{ x ) = (t>k { x ) — <(>,■& and 


ipjk( x ) = 0fc(a^) — ( f>ik( z ), r espectively), Assumptions (A3) and (A3’) are almost identical. In 
particular, Proposition 4.1 remains true with XL replacing XI Z . Therefore, if in 4.4 is finite, 


remains true with XL replacing XI Z . Therefore, if ^ 2 ^ in 
both Assumptions (A3) and (A3’) hold. Define the sparse additive functions class 

JC d (s) = {/ = | \S\ < s,fj € 77(2, L) for jes}. 

We have the follow theorem on the estimation rate. 

Theorem 5.1. Consider the data generating process as in 


(5.5) 


2.1 


and / £ JCd(s) where s is finite. 
Suppose that Assumptions (Al), (A2), (A4) and (A3’) hold. If h = o(l), m —> 00 as n —> 00 and A 
satisfies 4.5 , then there exists a constant C such that with probability 1 — c/n for certain constant 
c > 0, 


sup V 11/3, ;2 - f3 


sm , 
| 2 < -A 

Prriin 


and 


C A / 771 

sup I a z - fi(z)\ < -A. 

z&X Praia 


The proof of this theorem follows the same procedure as the proof of Theorem 4.2 shown in 


Section 8.1 so we omit the details. 



6 Z = argmin 6 1 Xb0, subject to ||XL0 — ei| | 20 o^7) ( 5 - 6 ) 


where XL = n 
is given as 


T'Wj.T' 7 and e, is the first canonical basis in 1 ) m + 1 . The de-biased estimator 


fiiz) = d z + -e^ \i> t w z (y - */3+). 


n 


(5.7) 


We ap ply t 


Section 


4.3 


re multiplier bootstrap to estimate the distribution of sup zg x 
Given the variance estimator a 2 = n^ 1 Y^=i0^ ~ — J2j=2 


Vnh{fi(z) - fi{z)) as in 
Ek=i *TA*<) 2 ^d the 
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variance of the Gaussian multiplier process d'^{z) 
multiplier process 


Hn(^) 


V nh 


n 


= n _1 0j¥W^ T 0*, 

aKhjXg - z)*Jd z 
°n{z) 


we consider the Gaussian 
(5.8) 


where £1,.. • , £ n are independent N(0, 1) random variables 
of sup- g ^ H n (z). Similar to 4.12 


C b 

'-'m 


= K, a (z)\ 


Let c n (a) be the (1 — a)th quantile 
, we can construct the confidence band at level 100 X (1 — a)%: 


z € X}, where 


Ct, a ( z ) : = fi( z ) ± c n (a)(nh) l/2 d n {z). 


(5.9) 


Recall that E z = K[Kh(X\ — 2)^1.^^] is the population version of E z . In order to establish 
valid theoretical results on the confidence band C^ a , in addition to the restricted eigenvalue condition 
(A3’), we assume the following stronger assumption on H z . 


(H) For any z G A, the matrix S 2 is invertible and the first column of the inverse, 0 Z = XI 2 1 ei, 
satisfies sup ze ^ ||#z||i < C\Jm, and s\vp z&x e\O z < C for some constant C. 


Remark 5.2. Since 6 Z is a {(d- 1) m + l)-dimensional vector, Assumption (H) implies that the 
dependency between X\ and Xj for j > 2 is weak. For example, if X\ ,Xj are pairwise independent 
for all j > 2, fhei = (E[K h (Xi~ z)\, 0,..., 0) T as E [K h (Xi - z)iJj 3k (X :j )] = 0 for any j > 2. 
Therefore, (A3’) guarantees that E z is invertible and Assumption (H) is satisfied due to 

sup ||0 z ||i = sup[EiL/ l (Xi — z)]~ l = supp^ 1 (z) + o(l) < 2/6. 

z£X zGX z£X 


The following proposition gives a sufficient condition for Assumption (H) to be true. 

Proposition 5.3. Let Eli 2,2 * = PS Z P 7 be the sub-matrix of S 2 where P = Yl2<k<(d-i)m+i e fc e j[ 
and e k is the k -th canonical basis in Suppose there exists a p m j n > 0 such that 


inf inf 

zex /3gR(rf-i)"i 


/3 T Ei 2 ’ 2) /3 

PI 


^ Pmin 

— m 


(5.10) 


Let pij(xi,Xj) be the joint density function of (X\, Xj) and Pj(xj) be the density of Xj for any 
j G [d]. Suppose there exists a constant M such that 


sup 

(x\,...,x d )£X d 


d 


E 


Pl,j(xi,Xj) 

Pi(x 1 )pj(x j ) 


M 
log d 


(5.11) 


Finally, under Assumption (Al), if y/m/logd = o(l), then Assumption (H) is true. 

The proof of Proposition 5.3 is shown in Appendix |C.31 in the supplementary material. The next 
lemma provides guidance to the selection of the tuning parameter 7 in 


5.6 
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Lemma 5.4. Suppose that Assumptions (Al), (A3’) and (H) hold. Let 


1 = C 


m log (dm) m 


nh 


+ 


i /nh 


+ rrih/ 


(5.12) 


for sufficiently large constant C. Then the vector 6 Z = S, x ei is a feasible solution to the optimization 
program in 


5.6 


with high probability. In particular, we have 

- ei|| 2 ,oo < 7 ) > 1 ~c/d 


for some constant c. 

We defer the proof of this lemma to Appendix |E.41 in the supplementary material.. We are now 
ready to present the main theorem of this section which establishes a valid confidence band for a 
component in the sparse additive model under the identifiability condition 


5.1 


2.1 


with identifiability condition 5.1 . Suppose 


Theorem 5.5. We consider the SpAM model in 
e ~ JV(0, a 2 ) and that Assumptions (Al), (A2), (A4)', (A3’) and (H) hold. If h x n 'for 5 > 1/5 
and m x n p for p E (0, (J10A — 2 )/3) , there exist constants c, C > 0 such that for any a E (0,1), the 

5 ( 9 ] 


covering probability of C^ 


n,a 


m 


is 


1 G C^) > 1 - a - Cn~ c . 

In particular, the confidence band a is asymptotically honest, that is, 

lim inf P(/i G & ) > 1 - a. 
n —>00 ’ 

For the detailed proof of this theorem, see Appendix|B.2 in the supplementary material. 


(5.13) 


6 Numerical Experiments 

In this section, we study the finite sample properties of confidence bands for the ATLAS model and 
sparse additive model under the two identifiability conditions 


5.1 


and 2.5 . The ATLAS model 


is also be applied to a genomic dataset to explore the relationships between different genes. 

6.1 Synthetic Data 

We consider two kinds of synthetic models. In the first example, we compare the confidence bands 

and on the Gaussian multiplier bootstrap 


4.9 


based on the asymptotic Gumbel distribution in 
method in 4.12 for the ATLAS model. In the second example the empirical properties of the 


bootstrap confidence band for sparse additive model are evaluated under the identifiability condition 
In both examples, we use the quadratic kernel K q (u) = (15/16) • (1 — u 2 ) 2 l(|tt| < 1) as the 


5.1 


kernel function in 


3.5 


Example 6.1. We generate data from the following ATLAS model 

4 

Yi = aifi(Xn) + ^ a j (Xn)f j (Xij) + £ U 

3 =2 
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where the additive functions are designed as follows 

fi(t) = -2sin(27rf), f 2 (t) = t 2 - 1/3, f 3 (t) = t - 1/2, / 4 (i) = e 4 + e _1 - 1; 

a\ £ {0, 1}, a 2 (t) = 2K q (4t — 1), a 2 (t) = 3cos(27t t), a 3 (t) = 4. 


Here two values of a\ £ {0,1} correspond to two scenarios that the true function is zero and nonzero. 
The noise e* ~ N(0,a 2 ) for i = 1 ,n with a = 1.5. This ATLAS model is constructed based 
on the synthetic example in Ravikumar et al. (2009 by adding cij (t)’s according to Example 2.3 
The covariates X tJ are independently and identically generated from Uniform[0,1] distributions for 
i = 1,... ,n and j = 1,... ,d. It can be checked that this model follows the identifiability condition 

the true function f*(t) = a\f\ (t) 


m 


2.5 


According to the argument in Example 


2.3 


We set the 

dimension of covariates to be d = 300 and consider three sample sizes n £ {100, 200,300}. In the 


kernel-sieve hybrid estimator 3.5 , we use the cubic B-splines with nine evenly distributed knots 


and m = 5. The tuning parameter A and bandwidth h are chosen by cross validation according to 
the BIC criterion defined as 

BIC = log(RSS) + df l0gn 


n 


where RSS is the residual sums of squares and the degrees of freedom is defined as df = s • m with s 
being the number of variables selected by the estimator. The confidence bands in 


are constructed at the significance level 95% and the quantile estimator c n (a) in 4.12 is computed 


4.9 

and in 

4.12 

4.12 

is computed 


by bootstrap with 1,000 repetitions. To measure the coverage probability of the confidence bands, 
we compute empirical probability that the confidence bands cover the true function on the first 100 
data points based on 500 repetitions. The numerical results are reported in Figure [ 2 ] and Table [T| 

Example 6.2. We consider the sparse additive model 1} = Xq=i fj(Xij) + £», where 


fi(t) = 6(0.1 sin(27rf) + 0.2 cos(27rt) + 0.3(sin(27rf)) 2 + 0.4(cos(27rf)) 3 + 0.5(sin(27rf)) 3 ), 
f 2 {t) = 3(2f - l) 2 , / 3 (t) = 5t, / 4 (i) = 4sin(27rt)/(2 - sin(27rt)). 


2009 


is considered 

by 

Zhang and Lin 

2006 , 

Meier and van de Geerand Peter Biihlmann 

Huang et al. 

2010 . Let W \,... 

IL}/ and U follow i.i.d. Uniform[0,1] and 




Wj + tu 

1 + t 


for j = 1,... ,d. 


The data sample Xij,... ,X n j are i.i.d. copies of Xj. The correlation between Xj,Xji is therefore 
t 2 /(l+t 2 ) for j % j'. We set t = 0.3. The noise {£j}” =1 are i.i.d. JV(0,1.5 2 ). Similar to the previous 
example, let the dimension d = 300 and the sample sizes n £ {100, 200, 300}. We also use the cubic 
B-spline basis with nine evenly distributed knots and m = 5. The parameter 7 in 5.6 is set to be 


7 = 0.05 


mlog(dm) rn 


nh 


+ 


y/nh 


+ mh‘ 


We again tune A and h through cross validation by minimizing the BIC criterion. We aim to 
construct the confidence band for f*(t) = f\ (t) — E[/i(Ai)]. In the simulation, we use the sample 
mean E n [/i(A"i)] := n^ 1 Y17=i 1 ) center To test the coverage probability of confidence 

bands for inactive covariates, we also construct the confidence band for f 3 (t) = 0. The empirical 
probability that the confidence bands cover the true function on the first 100 data points is computed 
based on 500 repetitions. The results are summarized in Figure [ 3 ] and Table [T] 
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Scheme 1: 

a± = 1 

Scheme 2: a\ = 0 

Sample size: 

100 

200 

300 

100 

200 

300 

ATLAS D Gumbd 
Bootstrap 

0.45 

0.67 

0.80 

0.36 

0.57 

0.71 

0.67 

0.83 

0.91 

0.41 

0.79 

0.90 


Scheme 1: 

fi 

Scheme 2: 

fa 

Sample size: 

100 

200 

300 

100 

200 

300 

c . , T Gumbel 

SpAM ^ 

Bootstrap 

0.43 

0.52 

0.70 

0.53 

0.60 

0.83 

0.78 

0.82 

0.93 

0.90 

0.93 

0.98 


Table 1: Comparison of coverage probability for confidence bands at significant level 95% for the 
ATLAS model Y = <i\f\ (Ah) + Y^j=2 a j( x fafj( x j) + £ an d SpAM model Y = XT/=i fj( x j) + £ 
with dimension cL = 300, sample size n = 100, 200,300 and e ~ 1V(0,1.5 2 ). 


6.2 Real Data 

We consider the real dataset on the relation between gene and riboflavin (vitamin B 2 ) production 
with bacillus subtilis. The dataset is provided by DSM (Kaiseraugst, Switzerland) and it is publicly 
available in Supplementary Section A.l of 


Biihlmann et al. 


2014 


The response variable Y 
represents the logarithm of the riboflavin production rate. The covariates are the logarithm of gene 


expression levels with dimension d = 4,088 and sample size n = 71. van de Geer et al. 


Biihlmann et al. 

2014 and 

Javanmard and Montanari 

20!4 


significant genes. 


van de Geer et al. 


2014 


2014 


finds no significant genes, 


use the linear model to find potentially 

finds 


Javanmard and Montanari 


2014 


Biihlmann et al. 


2014 


finds two genes YXLD-at and YXLE-at 


the gene YXLD-at and 

to be significant. In this paper, we use the nonparametric model to find whether the two genes 
YXLD-at and YXLE-at are significant. We first normalize the covariates onto [0,1] and use the 
ATLAS model to construct both gumbel-type and bootstrap confidence bands for the two genes 
YXLD-at and YXLE-at at significance level 95%. The results are illustrated in Figure [ 4 ] We can see 
that both genes have significantly nonzero effects. However, the gene YXLE-at has larger part of 
the domain where zero locates within the confidence band comparing to YXLD-at. Moreover, the 
magnitude of regressed function on YXLE-at is smaller than YXLD-at. These explain the reason why 
YXLE-at is less significant than YXLD-at in the previous analysis. 

7 Discussion 


In this paper, we consider a novel nonparametric model, ATLAS, which is a generalization of 
the sparse additive model. ATLAS naturally models high-dimensional nonparametric functions 
having different sparsity in different local regions of the domain. We consider the kernel-sieve 
hybrid regression to estimate the unknown function. Since we consider functions in the 2nd order 
Holder class, only Nadaraya-Watson-type kernel estimator is considered. However, it is not hard to 
generalize the loss function in 


3.4 


to local polynomial regression 


-1 ^ / P / •yr \£ 

C z (a, P) = ~J2 K hi x n -z)(Y i -Y-a-J2 ( * ~~ * 


n 


i=l 


t=\ 


l\ 


d m 

EE tpjk ( X ij ) k 

3 =2 k =1 
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ATLAS (ai = 1 ,n = 100) 



ATLAS (ai = 0, n = 100) 



—e—unbiased 

- gumble 

-bootstrap 



ATLAS (ai = 1, n = 200) 



ATLAS (ai = l,n = 300) 



ATLAS (ai = 0, n = 300) 

3|-!-I-!-j=!=^=n 


Kernel-sieve 
—b- unbiased 

- gumble 

-bootstrap 



Figure 2: Kernel-sieve hybrid estimators for the d = 300 dimensional ATLAS model Y = a\f\ (X ]) + 
Y^j =2 + e, for n = 100, 200, 300 and the noise e ~ 1V(0,1.5 2 ). The confidence bands at 

significant level 95% cover /* = a\f\ for a\ G {0,1} respectively. 


SpAM ( f 1 (t),n= 100) 






SpAM n = 200) 


6- /' 

dr 

2- 

0- 

-2- 


-4 kernel-sieve 

-a- unbiased 
-bootstrap 


”0 0.2 0.4 0.6 0.8 

SpAM (/ 5 (f), n = 200) 



SpAM n = 300) 



SpAM ( f 5 (t),n = 300) 

8|-i-j-j- i = 



Figure 3: Kernel-sieve hybrid estimators for the d = 300 dimensional SpAM model Y = 
]C j=i fj{Xj) + £, for n = 100,200,300 and the noise s ~ 1V(0,1.5 2 ). The confidence bands at 
significant level 95% cover f\ (t) on the first row and /s(t) = 0 on the second row. 


YXLD-at YXLE-at 



Figure 4: Kernel-sieve hybrid estimators for the riboflavin dataset using ATLAS model. 


21 




























































































































We can apply a similar proof technique to show the statistical rate of the estimator based on the 
generalized loss in higher order Holder classes. Corresponding methods to construct confidence 
bands can also be applied. 
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8 Proofs of Main Theoretical Results 


We collects the proofs of main theorems in this section. The proofs of Theorem 4.2 and Theorem 4.3 
is presented in Section 8.1 and Section[R2~|respectively. 


8.1 Proof of the Statistical Rate of Kernel-Sieve Hybrid Estimator 


This section outlines the proof of Theorem 4.2 on the statistical estimation rate of the kernel-sieve 


hybrid estimator in 13.5 J . Before presenting the main proof, we list several technical lemmas whose 
proofs are deferred to Appendix [d] in the supplementary material. 

The following lemma provides the restricted eigenvalue condition on the empirical Hessian matrix 
of the kernel-sieve hybrid loss in |3.4| , which is = n~ 1 \E r W Z $ J . 

Lemma 8.1. Under Assumptions (A1)-(A5), suppose (3 £ and a £ M satisfy the cone 

restriction 

Y WPjh < Ill'll 2 + 3 V™M 

jeS c j£S 

for some index set S C [d] with cardinality s. Denote 6 = ( a,(3 T ) T . If s -^/m 3 log(d?n)/ (nh) + 
sm 2 /(nh ) = o(l), there exists a constant p m i n such that with high probability, 


inf0 T £,0>^||/3|| 2 
z&x 2m 


, Pmin | 1 2 

2 + 


The estimation error for the kernel-sieve hybrid estimator comes from three sources: (1) noise e, 
(2) approximation error by finite B-spline bases, and (3) approximation error by s local additive 
functions to the true function. The following lemma provides the rate for the B-spline approximation 
error, which further illustrates how the number of B-spline basis functions m influence the rate. 

Lemma 8.2. Let S z = (<5i ( z),. d n (z)) T where 5i(z) = J2j=2 ~ ) for i = 1 ,..., 

where f m j;z(') is defined in 3.1 . Under Assumptions (A1)-(A5) there exists a constant C > 0 


such that the following three inequalities hold with probability at least 1 — 1/n, 


1 


sup max — ||Tv;WU<L ||2 < Cy/s-m 

.V J>2 n •' 


-5/2 


1 


m 


sup — |\I , . 1 W 2 <$ Z | < C^fs ■ 
z£X n 

sup -WYSzWl < Csm~\ 
z&x n 


-2 


( 8 . 1 ) 

( 8 . 2 ) 

(8.3) 
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Our next lemmas bound the approximation error of charts under the ATLAS model 


2.2 


. We 


can see that both the number of bases m and the bandwidth h play a role in the estimation. 

Lemma 8.3. Let £ z = (£i(z), ..., £ n (z)) T and Cz = (CiOO, • • ■ , (n(z)) T , where &(z) = fi(X u ) - 
fi(z) and Ci(z) = f(X u , ■ ■ ■, X di ) - Yfj=i fjz{Xji) for i 6 [n]. Under Assumptions (A1)-(A5) there 
exists a constant C > 0 such that the following three inequalities hold with probability at least 
1 - 1/n, 


h\og{dh~ l ) m 3 / 2 log(d/i x ) h 2 


n 


n 


m 


supmax-||^ W 2 (^ + C)lb < C 

z$LX 3 £ [d] Tl 

sup i|vh:,W 2 (6 + Cz)\ < C (h 2 + y/hjn) , 
z&xn \ J 

sup -||Wj/ 2 (^ + Cz)\\l < ch 2 . 
ze* n 

The following lemma quantifies the statistical error arising from the noise e. 

Lemma 8.4. Let T n = sup 2g ^ maxjg^ n _1 ||T , ^ / W 2 £|| 2 , where e = (ei, ... ,s n ) T . Under Assump¬ 
tions (A1)-(A5) and if m(nh)~ l = o(l), there exists a constant C > 0 such that with probability 
at least 1 — 1/n, 

T n < C^log(dm 2 h~ 2 )/(nh). 


We are now ready to present the main proof of Theorem 4.2 


Proof of Theorem f.2 We denote rj z = e + 5 Z + £ z + Cz and define the event 


£ = <j supmax-||T'^W-77 2 || 2 < A [> |J jsup -\'&l l w z r} z \ 2 < 


Using Lemma 


2 e* 3 >2 n 

Lemma 


and Lemma 


there exist constants c, C such that P{£) >\ — c/n 
if the tuning parameter satisfies the following inequality 

A > c(,M dm l h X + Vs ■ m + J hl °Z( dh -"l + "> 3/2 l°g (dh-') + JX\ (8.4) 
\ V nh V n n y/m J 

In the rest of this proof, we are always conditioning on the event E. 

Denote S z := {j € {2,..., d} \ fj z ^ 0} and A = (3 + — /3 + , where Ai = a z — ff(z) and 
A j = Pj ~ Pj for j > 2. We start by showing that A falls into the cone 

A z := {A : Ejesj H A jll2 < 3EjeS z II - A jh + 3\/m|Ai|}. 

Since /3+ is a minimizer of the objective function, 

l\\Wl/ 2 (Y - vE^EII 2 - i||WV 2 (F - */3 + )|| 2 + All^lr.a - \\\f3\\i,2 + W^(\$z\ - \fi(z)\) < 0. 
On the event £, we have the following inequality 

sup -r]JW z ^A < supmax-||^f-W z r7 z || 2 ||A2:d||i,2 + sup -||^ 1 W z 77 z || 2 |Ai| 


zex n 


z&x i> 2 n 
< A||A 2: rf||i )2 + Xy/m\Ai\. 


z&x n 
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The first inequality is due to the Holder’s inequality and the second one is by the definition of £. 
Furthermore, we derive the following inequality 

^||Wy 2 *A||2 < - A^dlftll - Pill) - Av^(|S,| - \mz)\) 

3 =2 


A, 


d 


< ^\\P-P\\i,2 + *Vrri\az-fi(z )\- A^(||^|| - || (3. 

3 =2 

3A 3A >—. . A ii * i 

<yL H A ill + -yVw| A i| - y 2^ ll A ill- 


3 I 


jeS* 




The last inequality shows that A € A z . 

Next, we prove the rate of convergence by contradiction. Suppose that for some fixed t, which 
will be specified later, we have 


1 


3z E X, -=||Wy 2 ^A||>t. 


n 


(8.5) 


Equation 8.5 implies that there exists some z € X such that 
0 > 


nun 


l -Wl ,2 {y - *3+)lli - yIIW \ /2 (Y - *(3 + ) II 2 + A||3+I| 1j2 - A||/3 + || 1>2 . 


Aeyt z ,||El /2 A||>t n 

— 1/2 —> i/2 

Using the fact that A z is a cone, we can replace the constraint ||S/“A|| > t by ||5]/ A|| = t and 
the above inequality still preserves. Combining the event £, we have 


0 > 

min 


AeA,,||E* /2 A[|=t 

> 

min 


AeAz,||£i /2 A||=t 


n 


n 


From Lemma 8.1 we can bound the R.H.S. by 

277 (A) - K0+) + 77 (/ 3 +) <3 II A jII + 3 ^| Ai| 

ies 

< 3 \/i || A 5 u 5 ll 2 + 3 \/m| Ai| 

< 6^2sm/p m i n ||S~ /2 A|| 2 - 


Combining (8.6) and (8.7), we get a quadratic inequality 

Ism 


0 > tr — 


Pmin 


t. 


( 8 . 6 ) 


(8.7) 


( 8 . 8 ) 

t = 0, which is a 


Setting t = 2^/sm/p m i n ■ A, we obtain from 8.8 that 0 > t 2 — 2A \Jsm/p a 

contradiction. Therefore, sup 2g ^ n _1,/2 || w! // “T r A|| 2 < 2\^/sm/p m \ n - Using the rate for A in 
and h = o(l), we have 


8.4 


-=||WV 2 ^A|| 2 < 


sup 

z&xVn 


log (dmh~ l ) yfs m 3 / 2 log(dh x ) h? 
nh m 5 / 2 n \fm 


(8.9) 
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Now, using Lemma 8.1 since A € A z , we have that 

|| A 2: d||i,2 + \fm\ Ai| < v^llAaclla + \/™|Ai| < 7W(^)l|Wy 2 ^A||2 

for any z E A, which leads to the following inequality 


,—*. n s aw ^ n ( /log(dmh x ) yfs m 3 ^ 2 log(dh x ) h 2 \ 

sup sfm\a z - ^(z)] + \\/3 - p 1,2 < Csm( \ - - -- H-^ H- - + —1= , 

Z( zx \V nn m?l A n yjm) 

with probability at least 1 — c/n. To obtain the best rate on the right hand side of the equation, we 
choose h x n~ 1//6 and m x n 1 / 6 to obtain 

sup | \fm\a z - fi(z)\ + ^2 WPj ~ A'lbj = Op (log (dn)n~ 1/4 ^j . 


»=2 


From 


3.8 


, we have ||/ — /|| 2 < p n 2 n Sy/m\ and, when h x n 1 ^ 6 and m x n 1 / 6 , the rate becomes 

II/- /III = Op (V 2/3 log(dra)) . 


This completes the proof. 


8.2 Limiting Distribution of the Kernel-Sieve Hybrid Estimator 


□ 


This section proves Theorem 4.3 which establishes the asymptotic distribution of the proposed 


estimator. We first state the following lemma on the de-biased estimator. 

Lemma 8.5. Let vi = e\/p\(z). Under Assumptions (A1)-(A5), with probability at least 1 — 1/n, 

sup ||vfS z - ef 11 2,00 < Cm (h 2 + \J\og(dm) / (nh)) . 
z e;r v ' 


We defer the proof of this lemma to the end of this section. With Lemma 8.5 we are ready to 


prove Theorem 4.3 


Proof of Theorem f.3 From 4.8 , we have the decomposition 


1 U 1 II 

fi 0) = [PiO)] -1 -^ ]K h (Xn - z)ui + \pi(zj\~ l -y^K'htXn - z)^ 

n z —' n z — J 

i =1 2—1 

S -v-' V -v-' 

h(z) h(z) 

+ (ef - (pi(z), # T W Z 1)) (P + - /3 + ), 

S -V-' 

Hz) 

where 1 = (1,..., 1) T E M n , Ui = £% + ff (Xu) is the combination of noise and 

d d 

= ) - fmj(Xji) + f(X u ,..., X di ) - fjziXji) for any i E [n]. 

j =2 3=1 


( 8 . 10 ) 


( 8 . 11 ) 
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We will study Ii(z), h(z) and I 3 (z) separately. 

The first step is to bound sup, g ^ |T 3 ( 2 ;) |. Using the Holder’s inequality, we have 

\h(z)\ < II- ef || 2 ,ooP+ - /3+112,1- 


Combining Theorem 4.2 with Lemma 8.5 we have 


’( sup \I 3 {z)\ <q n ) < 1-, 

V zex ) n 


where the explicit rate of the supreme of I 3 can be bounded by 


q n = Cm, h 2 + 


log (rim) 
nh 


■ sm 


log(im/r 1 ) y/s m 3 / 2 log(d/i 3 ) h 2 


= O m 3/2 h 2 + 


log (dm) 


nh 


nh ' m 5 / 2 

rn log (tim/m 1 ) 


n 


h z + m 2 + 


nh 


( 8 . 12 ) 


m 


If/ixn 5 , m x n * 5 for <5 > 1/5, we have sup, gi ^ \/n/i|/ 3 (z)| = op(n ■ 1/5 ) = »p(i). 

Next step is to study I 2 (z). From the proof of Lemma 8.5 (see equation 
sup* e *pi(z) > b/2 with high probability. From Lemma 


8.2 


and Lemma 


8.3 


8.17 


we have 


sup \h(z)\ <c(h 2 + yjh/n + m 2 ) 
z&x v ' 


1 

> 1 - 

n 


), we have 
(8.13) 


Again, if h x n s , m x n * 5 for <5 > 1/5, we have sup zg ,*- \fnh\I 2 i z )\ = o P (l). 
Finally we study the asymptotic distribution of 


h{z) = 


£r=i K h (Xj 1 - z)ui 

YJUKhix^-z) ■ 


Actually, 8.14 is the Nadaraya-Watson estimator (Nadaraya 
Ui = f*(Xi 1 ) + £j. From 


Hardle 


1964 


Watson 


1964 


(8.14) 


for the model 


(1989 , under Assumptions (A1)-(A5) and (KR), we have that 


( 2 hlogn ) 1/2 {sup(nh) 1 / 2 r( 2 ;) \h(z) - ff(z)\ - d r 
Vzex 


< u 


+ exp (— 2 exp(— u )). (8.15) 


Since sup zeX y/nh\l 2 (z)\ = op(l) and sup zg ^ \/nh\I 3 (z)\ = op(l), the proof follows from 


8.10 


8.15 


and 

□ 


Having proved the main theorem, we now come back to give a proof of Lemma 8.5 


Proof of Lemma 8.5 We have the decomposition 


1 


|vf % z - ef || 2 ,oo = Ti \/ max sup - Yf K h (X u - z)& 
v i > 2 n ^ 




(8.16) 


where T\ = sup z& x n 1 £” =1 [Kh{Xi 1 — z) — We also introduce a random variable Tjk for 

j > 2 and k G [m]: 

1 n 

T jk = sup - V K h (Xn - z)ifjk(Xij). 

zexn^ 

1=1 
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Using Theorem 2 in 


Louani 


1998 


Therefore, using Assumption (Al), with high probability, we have 

,-i 1 


sup 

z£X 


v! S z ei - 1 


, there exists a constant C such that P(Ti > C(nh ) 1 / 2 ) < 1 — 1 jn. 

Cb 


sup (pi(z)) 1 -y2(K h (X il -z)-pi{z)) < —j=. (8.17) 

ze,v n Vnh 


To bound Tjk, we define A (j)jk(z) = IE [cj)k(Xj) \ X\ = z\ — cj)jk(z) and write 

1 n 

T jk = sup - V] K h (Xi i - z)Acj)jk(z). 

Notice that A cj)jk(z) is the estimation error of the Nadaraya-Watson kernel regression estimator 
defined in 3.3 , which can be studied in similar way as in the proof of Theorem 4.2 In particular, 
there exist constants c, C such that with probability at least 1 — c/n, 

sup sup \A<t>jk(z)\ < C( h 2 + y/log{dm)/(nh )). 

j'>2,k£[m\ z$LX 


Combining with 8.17 , we can bound 


1 -y . 

max Tjk < max sup A6jk{z) ■ sup — > KAXn 
j>2,k£[m] i>2,fce[m] z& x zex n 


- z 


< C ^ h 2 + y/log (dm 


,, „ Cb ^ 
Pl “ + vs)' 


Therefore, with high probability 

_ n 

Kh(Xn — z) i Sfji < m ■ max Tjk = O (m(h 2 + -^/log (dm) / (nh))) . (8.18) 

T - ^ 2 i>2,fcg[m] V / 

, we have 


max sup 

i > 2 ze* 


2 — 1 


Combining 


1.17 


and 

8.18 

with 

8.16 


f\ Sz - ef || 2 ,oo = Op [m{h 2 + i/log (dm)/(nh))^ , 


which completes the proof. 


□ 
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A Accelerated Algorithm 

This section presents the detailed method to accelerate Algorithm |Tj To estimate the marginal 
influence function ff, we need to compute the estimator a z for a number of z values z G {z±,..., zm}- 
A naive approach is to run Algorithm |T| for M times, once for each value of z’s. We provide a more 
efficient algorithm which significantly reduces the computational cost. From Algorithm [T] and {3.12 , 
the most expensive operation is to evaluate the gradient 

VjCz(P { + ] ) = -^W 2 (y - ¥/3j } ) . (A.l) 


30 





Computing X jC z ((3^) for a single 2 requires 0(dm 2 n ) flops. If we trivially repeat the computation 
for M different z' s, the computational complexity is 0(dm 2 nM ) which is challenging when M is 


large. Howeve r, we can exploit the structure of X jC z (f3 
According to 




V jC z (p+) has a formulation 


A.l 


and the fact that 'ipjk(Xn) = (f>k(Xn) — 4>jk(z) in 


to reduce t he co mputational complexity. 

the fc-th coordinate of 


3.3 


(Vj£ z = ~~Y K h {Xi 1 - z)MXn)Yi + $ jk (z) • - Y K h(Xi 1 - z)Y i 

ie[n] ie[n] 

f 1 ” 

+ X] “ X! K h(Xil - z) [<f>k(Xn) - <t>jk(z)\ [(t)s(Xn) - (f>g s (z)) 

ee[d\,se[m ] ' *=i 


(A-2) 


The computation of X jC z (/3^) is mostly spent on evaluating the formulation 


n 

q(z) = Y K h(X l] - z)m 

i —1 


(A-3) 


for z G {zi, ..., zm I where u \,..., u n are fixed quantities (e.g., tq could be Yi, (j>k{Xn) or Yi(f>k(Xn) 


when evaluating 


A.2 


independent of z. 


general form q(z) and apply it to the computation of A.2 


We next introduce a fast method to calculate the 
Without loss of generality, 


we 


assume that z\ < ... < zm■ The naive method to evaluate {q(zg)}g e [ M ] separately for different 
z has the computational complexity 0(nM). However, if the kernel function has some special 
structure, we can reduce the complexity to 0(n + M). For example, for the uniform kernel 
K(u) = \ 1{M < 1}, when we vary the value of z from zg to zg + 1 , we just need to subtract u t for 
those i G {u : X v G [zg — h, zg + \ — h)} and add u, for those i G {u : X v G (zg + h, zg + \ + h]}. For 
M S> /x 1 , the cardinality of {* : X t \ G [zg — h,zg + \ — h] U (zg + h,zg + \ + h]} does not increase 
with n or d. Therefore, the complexity to evaluate {q{zg)}g G [M] is reduced to 0(n + K). For 
the Epanechnikov kernel K{u) = (3/4) • (1 — u 2 ) l{|u| < 1}, suppose q(zg) is known and define 
h = {* : X a G (zg - h, zg + 1 -h] U (zg + h, zg + 1 + h]}. We have q(z e+1 ) = q{zg) + A q(zg), where 


Xq(zg) = q(zg +1 ) - q{zg) = j Y ( X “ (X a /h) 2 ) Ui + ^ Y X H + XT Yl 


i&h 


Ui. 




iele 


Similar to the argument for the case of uniform kernel, we also have \Ig\ = 0(1) if K 3> h~ l . The 
computational complexity of Xf^Ui and the other two summations above for z = 1,... ,zk 

is 0(n + K ) and hence the computational complexity of {q(zg)}ke[K] f° r Epanechnikov kernel is 
also 0(n + K). We can also apply a similar trick to many other kernels. Now we turn back to the 
calculation of the gradient Xj£ z (f3^}). Let pi(z) = n -1 YH= 1 Kfi{Xn — z), 


y£\z) = ~Y Kh ( Xil - z ^ Y k\ z ) = - E K h( x n - z WkiXrfYi, 

i= 1 2=1 

■t n 1 n 

yf\ z ) = ~Y1 Kh ^ Xil ~ z )M x ij) and R ka (z) = ~Y Kh ( Xil ~ z )MXij)MXiu). 
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According to the expansion in A.2 , we can write the k- th coordinate of V jC z (f3^f}) as 


V;4(tf)), = -Yf>(z) + $ jk (z)Y£ 


( 1 )/ 


,( 2 ) 


« + - 
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E (3es;z (Rkv(z) - <t>jk(z)YW (z)j 


- E 

n ■' 

/e[d],sG[m] 


£S[rf],sS[m] 

r (3) 


A? s;2 ( (j>is(z)Y i £ Rz) - <t>jk{z)<t>i 8 {z)p\(z 


Based on the previous discussion on the calculation of q(z) in (A.3), we note that it takes 0(n + 
M) operations to evaluate p\ (z). J (z),Y / }' Jj (z),Y^ (z ) and Rk s (z ) for M different values of 2 . 
Therefore, the computational complexity of each iteration in Algorithm [T] can be reduced from 
0(dm 2 nM ) to 0{dm 2 (n + M)). 

Algorithm|T|can also be applied to minimize the objective function in |3.9| for SpAM estimation. 
Following a similar analysis, we can also show that the computational complexity of each iteration is 
0(dm 2 n). Therefore under the case M = 0(n), we can estimate /* using the introduced procedure 
with the same computational complexity as 


Lasso involve evaluating the gradient 


et al. 


2013 


3.9 


Since most of existing algorithms for the group 


Yuan and Lin 


2007 


the above argument is applicable to other so 


Friedman et al. 


vers as well. 


2010: 

Farrell 

2013: 

Qin 


B Covering Properties of the Bootstrap Confidence Bands 

In this section, we prove the theorems on the coverage probabilities for the Gaussian multiplier 
bootstrap confidence bands a in 


4.12 


and C h n a in 


5.9 


B.l Proof of Theorem |4.4| 

We first prove that C^ a in (4.12| is honest. The strategy to prove the result is to establish a sequence 


of processes from G n (z) that approximate Z n (z). Recall that 

1 A „ £iK h (Xn - z) 


G n (z) = — V& 
\/n 


Z n (z) = 


i=1 o\J h 1 HK)pi(z)’ 
Vnh(fi(z) ~ fi(z)) 


a 


\A ( K )/Mz) 


where p x (z) = \ Ya=i K h( x n - z) and a = ± Ya= 1 \ Y if( x n, ■■■, x , 


some additional notation and technical lemmas. Let 




(B.l) 

(B.2) 

We first introduce 


r n ■ = 


s 2 log(dmh~ 1 ) s 3//2 slog(dh 3 ) 


nm 


- 2 L 
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nm 


- 5/2 


+ syfmh 2 . 


(B.3) 


The following lemma quantifies the estimation rate of the denominators of B.l 


and 


B.2 


Lemma B.l. Let the estimator for Var(e) = a 2 be o 2 = - Under Assumption (Al) and 

(A4), there exists constants C such that P(|f7 2 — cr 2 \ > Cr n y/m) < 6/n and 


P( sup 


a 


VMz) 


a 


y/pi(* 


- 1 


< C (r n y/m + \/log 
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> 1 - 
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To approximate the process Z n (z) by G n (z), we need an intermediate process 


1 n 

G n (z) = -=J2 


^*■^"(1^*1 — ^n)-^0i(-^Ql z) 


a 


V h 1 \(I<)p 1 (z) 


— E 


£il(\ei\ <6 n )^(X!-z) 


0-y / /TT\(K)p^) 


(B.4) 


The next two lemmas show how well G n (z) approximates Z n (z) and G n (,z). We denote 
W n = supG n ( 2 :), W n = supG n (^) and Wjf = supZ n (z). 

zEA' z€X z£X 


Lemma B.2. Under the conditions of Theorem 4.4 there exists constants c, C such that 


Wn -Wn I > Cn 


-1/10\ 


Lemma B.3. Under the conditions of Theorem 


4.4 


) ~ n 1 / 10 ’ 
there exists constants c, C such that 


W': - Wn I > 


Cn" 1 / 10 ) < 


n 


1/10 ' 


(B.5) 


(B.6) 


The proofs of Lemma 
We use Lemma 


et al. 


B.l 


B.l 


Lemma 


, Lemma 


B.2 


2014a to establish Theorem 


B.2 


and Lemma B.3 are deferred to Section 


E 


and Lemma B.3 together with Corollary 3.1 in Chernozhukov 


4.4 


Corollary 3.1 of 


Chernozhukov et al. 


2014a 


provides 


sufficient conditions for the confidence band to be asymptotically honest. Specifically, we verify the 
following high-level conditions: 

HI There exists a Gaussian process G n (z) and a sequence of random variables W° such that 
fU° = sup 2g .*- G n (z). Furthermore, E[sup 2g ^ G n ] < Cy/logn and 

P(| Wf -W°\> £ ln ) < 6 ln 

for some £\ n and 5\ n . 

H2 For any e > 0, the anti-concentration inequality 


supP 


sup |G n (z)| - x 

z6* 


< e ) < Cey/l og 


n. 


holds. 

H3 Let c n (cc) be the (1 — cc)-quantile of W% and c n (a) be the 1 — a quantile of W n . There exists 
r n , £ 2 n and § 2 n such that 


1 (Cn(a) < c n (a + T n ) - £ 2 n) < fan and P ( c n (a ) > c n (a - t„) + £ 2n ) < fan- 


H4 There exists £ 3 n and § 3 n such that 


P sup 
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o\Jv i{z 
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Chernozhukov et al. 


2014a 


If the high-level conditions HI - H4 are verified, Corollary 3.1 in 
implies that 

P(/l G C^ a ) > 1 — a — (ei n + £2 n + £3n + 5l n + <^2n + <^3n)- 

In the remaining part of the proof, we show that the conditions are satisfied. 

First, we verify the condition HI. We start by c onstructing a, good approxi mation to the 


intermediate random variable W n . Using Theorem A.l in 


Chernozhukov et al. 


2014a 


, there exists a 


Gaussian process G n (z) and W® = sup zgiY G n (z) such that W® approximates W n at a good enough 
rate as long as the uniform covering number of the function class 

Qk = {9z{X,e) = eK h (X - z)/\Ja 2 h- 1 X(K)pi(z) \ z £ X} 

is polynomial in the covering radius. By the definition of G n (z), we can rewrite it as 

1 n 

&n(z) = — g z (Xii,£iI(\£i\ < b n )) - E \g z (X a ,£iI(\Si\ < b n ))\, 

Wn 

1=1 

where g z (X, e) G Qk- The envelope of Qk can be found as 

eK h (X-z) 


\9z(X,e) | = 


A /o- 2 /i- 1 A(/t)fhO) 


< h-WbnWKWoo^XWb)- 1 ' 2 =: Cgh-^WoW, 


for any X G X and e < b n . Applying Lemma F.2 we have 


sup Al (Qk, L 2 (Q),C g h 1/2 v / log n ■ t) < 
Q 


21| K || tv A 


where the supremum is over all measures Q. Theorem A.l in Chernozhukov et al. 
implies that we have the approximation error 

n\W n - <I > e' ln ) < 8[ n , where 


2014a 


An = C 


h 1 / 2 log 3//2 n h 1 / 4 log n h 1 / 6 log 5 / 6 n _■ C' 

T C- tt~a -T G- TTiTri - and 5m = 
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9/20 
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1/4 
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2/15 
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(B.7) 

then 

(B.8) 

(B.9) 


Lemma B.3 gives us the rate of difference as follows 

P(| W n - W% | > Cn~ 1/10 ) < cn~ 1/10 

and using the triangle inequality we get the ultimate difference we desired 

P (|- W°| > e ln ) < 5 ln , 

where £± n = e' ln + CVi -1 / 10 and 5\ n = 5m + cn^ 1 / 10 . If h x n~ s and m x n 5 for 5 > 1/5, we have 
£1 n = o(n -1 / 10 ). Finally, by Dudley’s inequality, we have E[sup zg ^G n ] < Cy/\ogn, which verifies 
the condition HI. 

The condition H2 follows from HI and the anti-concentration inequality in Corollary 2.1 of 


Chernozhukov et al. 


2014a 
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Next, we verify the condition H3. Recall that c n (a) is the (1 — a)-quantile of W n . From the 
two approximation results in B.5 , we have 


> 2 Cn" 1 / 10 ) < 2cn- 1 ' w . 


(B.10) 


IP (| Wn-W? 

Therefore, we can bound the probability 

P(Wf < c n , a + 2Cn~ l ' w ) > P(W* < c n , a ) - P(| W n - | > 2Cn~ 1 ' 10 ) 

> 1 — a — 2 c/n 1 / 10 , 

which implies that the estimated quantile has the following lower bound 

c n (a ) > c n (a + 2C'n _1//10 ) — 2cn _1//1 °. (B.ll) 

Similarly, we also have c n (a) < c n (a — 2Cn~ 1 / 10 ) + 2 cn -1 / 10 . By setting r n = 2 Cn -1 / 10 , £ 2 n = 
2 cn “ 1 / 10 and S' 2 n = 2 cn -1 / 10 , we have 

P (E(a) > c n (a + 2Cn~ 1 / 10 ) — 2 cn _1//1 ° and c n (a ) < c n (a — 2C'n _1//10 ) + 2cn~ 1//10 ^ < 2 c/n 1 / 10 , 
which verifies the conditi on H 3. 

the condition H4 is satisfied for £ 3 ^ = C(r n y/m + y^og n/(nh )) = 


Finally, using Lemma 


B.l 


o(n 1 / 5 ) and £ 3 ™ = 8 /n. Since the high-level conditions HI - H4 are verified, the result follows 
from Corollary 3.1 in 


Chernozhukov et al. 


2014a 


B.2 Proof of Theorem 5.5 


The strategy for proving Theorem |5.5 is similar to that used in the proof of Theorem 4.4 We find a 
sequence of Gaussian processes to approximate the multiplier Gaussian process H ri (^). Similar to 
the approximation from 


B.l 


to 


B.2 


z = 


, we consider the following four Gaussian processes 
aK h {Xn - z)*TO z 




H«(z) = 


Vnh^ i=1 

1 




Vnh^ i=l 


ti 


&n(z) 

oK h {X lX - z)*T0 z 
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&n(z) 

K h (Xg - z)*To e 
Vnh,- 1 & n (z) 

Z n (z) = y/nh ■ a~ 1 (z) (ji(z) - fi(z)j . 


The roadmap is to establish that the process in 


the chain Z„ 


iK 1 ) 


B.15 

is c 

ose to the proces 

s in 

B.12 


l; similar to Lemma 


B.2 


check conditions HI — H3 as in the proof of Theorem 4.4 Since we do not use the population 


and Lemma 


B.3 


(B.12) 
(B.13) 

(B.14) 

(B.15) 

following 
After that, we can 


a n {z) = K[a n (z)] in the intermediate processes, we do not need to check the condition H4. 

In order to verify the condition HI, we first bound the difference between sup 2 g ^ ^n{ z ) and 
sup zeX Z n (z). We begin by considering two auxiliary processes 

<(*) = -TJXf E £ i K h(Xn - z)*To z and Z' n (z) = V^h (/ftz) - fi(zj) . 


i= 1 
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Notice that the above processes are un-normalized version of B.13 and B.15 , that is, H ' n (z) = 
a n (z)M n (z) and Z' n (z) = a n (z)Z n (z). The following lemma provides a direct bound for the difference 
between El ' n (z) and Z' n {z). 


-S 


Lemma B.4. Suppose that Assumptions (Al), (A2), (A3’), (A4), (A5), and (H) hold. If h x n 
for 5 > 1/5 and m x n p for p £ (0, (10<5 — 2)/3), there exists a constant Co > 5/2 such that with 
probability 1 — c/n, 


sup 

zeA” 


- Z' n {z) 


< Cn~ c °. 


We defer the proof of the lemma to Section E.6 and proceed to prove Theorem 5.5 We also 


need to quantify the range of in the following lemma. 

Lemma B.5. Let X z = . If mh = o(l),h = n~ s for <5 > 1/5, there exist constants 

c, C such that for sufficiently large n, with probability 1 — c/n, for any zgl 1 , 

ch~ 1 e[O z < e JX' Z 0 Z < Ch~ 1 ej0 z . 

and Assumption (H), we 


E.7 


We defer the proof of this lemma to Section 
have an upper bound of the inverse of <t/(z) = O'. Zl z Q z as 

sup Vh ■ °n l ( z ) < c - 

z&X 


From Lemma 


B.5 


(B.16) 


With Lemma 


B.4 


and Lemma 


B.5 


we are ready t o bou nd the difference between sup ze ^ H n (z) and 
Let h x n~ 5 for 5 > 1/5 and m x n p for 

From Lemma 


B.4 


sup 2GA . Z n (z). Let co be the constant in Lemma 
p £ (0, (1 0d — 2 )/3). We denote c = Co — 5/2 and observe that c > 0 by Lemma 
and 


B.4 


B.16 


, we have 


B.4 


(sup 2e * 

H„(z) - Z n (z) 

> Cn c ) < P fsup zeA . 

M' n (z) - Z' n {z) 

> Cd n (z)n c °/V~hj 



< P (sup zeAf 

K(z) - Z' n (z ) 

> C 2 n _c °^ < 1 jn. 


Define V® = sup zeA . H n (z) and = sup zeA . M n (z). Since sup zgiY H n (z) is a Gaussian process 
conditional on {A/i}j e r n i, we verify HI by 


(\v° -v z \> < P ^sup |H n (z) - Z n (z)\ > Cn 


1 

< -. 
n 


(B.17) 


The condition H2, as in the proof of Theorem 4.4 follows from HI and the anti-concentration 
inequality in Corollary 2.1 of 1 


Chernozhukov et al. 


2014a 


5.1 


Next, we check H3 by bounding the difference between 
H n (z) by Bin ( 2 ). Since the estimation rate in Theorem 
results of Lemma 
h 


B.l 


apply to the estimated standard c 


B.12 


and 


B.13 


We first approxi mate 


is the same as the rate in Theorem 


4.2 


n 5 for 5 > 1/5, with probability 1 — 6/n, 


cr — a \ < 


r n m 1/ = o 


eviation a. In particular, if mh = o(l) and 

("~ 1/10 )' 
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where r n is defined in B.3 . We denote V n = sup ze ^ H n (z), = sup zg ^ and the difference 

between V n — V^\ Let AH^ 1 ^) = Eli 1 ^) — H n (z). We have 

sup IAH^ (z) I <\S — cr\ sup \Fh ■ cr” 1 (z) ( sup h{z) + sup h(z) ) , 
z&X zeX \zeX zeX J 

where h(z) = n~ l YJi=\ K h( x i 1 - z)| '&J.(G Z - G z ) | and I 2 {z) = n~ l YJi=\ K h( x n - z ) \^J.G Z |. 

In order to bound Ii(z), we first state a technical lemma that characterizes the estimation error 
between 6 Z and 6 Z . 


Lemma B.6. Let G z be a minimizer of 


5.6 


For any z € X, we have 


G Z — G Z J Y1 Z ^G Z — G Z J < 211^1^1151, — cr|| max + 2||S 3 0 Z — ei||2 i0 o||^z||i- 

Moreover, suppose that Assumptions ( Al), (A2), (A3’), (A4), (A5), and (H) hold. If the 
parameter 7 in the optimization program 5.6 is chosen as in [5.12 , then with probability 1 — c/d , 


sup( G z - G z ) t Yi z (G z - G z ) < Cy/rn 
zex 


m log(dm) m 2 

- 7 - 1 - + rnh 

nh yj nh 


(B.18) 


We defer the proof of this lemma to Section E.5 Using Lemma B .6 we bound I\(z). Applying 
Cauchy-Schwarz inequality, we have 


sup | 

z&X 


/1 ^ \ 2 \ V 2 /1 n \ V 2 

\h(z)\ < sup K h(Xii - z) [*l(G z -G 2 )) J K h{Xn - z) 


< Cy/rn 


rn log (dm) m , 2 

- 7 -h —j= + mh 

nh y/ nh 


(B.19) 


where the last inequality is due to Lemma B .6 and 


sup n 1 y ^K h (Xg - z) = sup pi(z) + o(l). 


zex 


i= 1 




For I 2 (z), we have the following inequality 

1 


sup \h{z)\ < sup -||^ W 2 l|| 2 ,oo||0 2 ||l 
z£X z£X n 

< sup -iiwy 2 ^. i ^.wy 2 || 2 ||wy 2 i|| 2 || 0 2 || 1 

z&x n 

(j _ _ 

< —= ■ sup Jpiz) • yfm = 0 ( 1 ). 

V m zex 


(B.20) 


Therefore, combining 


B.19 


and 


B.20 


we have 


(|L - vm 


> C y/r n m 


i l ' A \ < n~\ 
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When h x n s 


for 6 > 1/5 and m x n p for p E (0, (105 — 2)/3), there exists a constant c such that 


r n m - 1 / 4 = 0(n c ). Since cr£j = Ei, we also have sup 


B.17 


z&X 




z) = sup zgiY H n (z). Combining with 


, we have 


V n - VJ 


> Cn c ) < 2 n 


,-i 


Therefore, H3 can be checked using the same argument as in 


B.10 


and B.ll 


. Now, since we have 


4.4 


checked the high-level conditions HI - H3, similar to the argument used in the proof of Theorem 
we have 

P(/iGCtj>l-a-Cn- c , 
which completes the proof of the theorem. 


C Properties of Model Assumptions 

In this section, we discuss identifiability condition for the ATLAS model and give proofs to 
propositions justifying certain assumptions. 


C.l Idenfiability Condition of ATLAS 


In this section, we show that under the identifiability condition 


2.5 


the charts of ATLAS model 


can be uniquely determined. Specifically, we have the following proposition. 

Proposition C.l. If {f jz }jed and {gjzjjed satisfy Y?j=ifjz{ x j) = I/'Ll Sjz( x j) for any z E A, 


x\ E (z — 5o; z + 5o) and (x2, ■ ■ ■ ,Xj) E X d 
fjz = Sjz for any j E [d\ and zEl 


:-i 


■‘3=t J J' ~ *—‘3- 
then under the identifiablity condition 


2.5 


we have 


Proof. Since Ylj=i fjz( x j) = Xq=i 9jz( x j), set Xj = Xj for all j E [d] and take the conditional 
expectation E[-|Ai = z] on both sides. This shows that f\ z = g\ z . Next, we fix the value of Xj = 0 
for all j > 3 and write / 2 Z (x 2 ) = g 2 z{x 2 ) + A f(z), where A f(z) = ^= 3 (^( 0 ) - fjz{ 0)). Taking 
E[-|Ai = z\ on both sides again, we have A f(z) = 0 and /A = g 2 z for any z E X. Continuing the 
argument for other components, we complete the proof. □ 


C.2 Proof of Propositions [4.11 

Let J be arbitrary subset of [d] and for any (a, (3) = (a,/3f,... ,/3j) T E C^'\j), where C^(J) is 
we consider the functions h\(x\) = a and hj(xj) = Y1 T=i Pjk'fjkixj), for j = 2,..., d. 


4.1 


dehned 

From the cone restriction in (4.1), we have 


Eiiftihs- E WPjh + Ky/m\a 
1 ieJjyi 
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and the B-spline property \\hj\\'i x m i ||/3||2 in 3.8 , there exists a constant ci > 0 such that 


m 

— M 

ci 


^2\\ h jh< E n^'ii 2 + 

1 &j\{ i} 

<k E \\PjW2 + + c^ 2 ) y/rn\a\ 


oeJ\{ 1} 

< ((V5I+1) K + c- 1/2 ) 


m • ^E 11 /ij112• 

jeJ 


Let c be the smallest constant satisfying ck > (^/ci + l) K + cq 1//2 . Therefore, we have (hi,..., hd) G 

Cf } (j). Consider the L 2 norm || ■ \\l 2 {^, z ) induced by the measure t u z (f) = K[Kh(Xi — z)g(-)] for 
any bounded measurable g. For any j > 2, let 'P\.j(x \, xj) be the joint density of (X\,Xj). Under 
Assumption (Al), for any measurable g and any z 6 A, we have 


\\g\\l 2 ^ z) =E[K h (X 1 -z)g 2 (X j )} 

= K h (x - z)g 2 (y)pij(x,y)dxdy 

Jx 2 

b f b b 

-ml, K ^ x ~ z)g 2 {y)Pi{x)Pj{.y)dxdy = j^E[g 2 (Xj)\ = -^\\g\\l 


B 2 Jx ^ 
Therefore, for any zGl, 


£ 2 ' 


d i 11 2 — ^2 ,ck(U) 


/3!s,/3 + >^2||E“=A||2> 


f?2 II^J=1 J 112 — £>2 

Applying the cone restriction on (hi,..., hd), we further have 

d 


(E 3E J^I| 2 ) 2 . (c.i) 


E"'* 

o'eJ 


dll 2 ^ 


1 


(CK + l) 2 


' d 

E»'*. 

> j =i 


'i 11 2 ^ 


1 


Eli'*. 


(CK + 1) 2 ^»'^"2- (ck + 1)2 

\ / o—l X 7 


Cm 


-l 


E ii^'ii 2 + 

. J= 2 


raa 


Combining the above inequality with C.I , we obtain that for any z € X, 

/3+5^/3+ > 


^ ^ Cbm~ l pr 2 


2 ,CK 


sB 2 (ck + l) 2 


(||/3||| + ma 2 ) , for any /3+ G ci h) (J). 


This completes the proof. 


C.3 Proof of Proposition [ 5 I 3 ] 

We separate U z into four blocks such that 


S 2 



h2,l )T 

‘Z 

r( 2 - 2 ) 


39 












where S^ 1 ' 1 ) € R, si 2,1) € and sl 2 ’ 2) g R(d-i)n*x(d-i)m. jY om 

(Al), both [S 
we have 

®< M) e< 2 - 1 ) T 


5.10 


and Assumption 

(Mb-i anc j exist for any z G X. By the inversion formula of a block matrix, 


S7 1 = 


v ei 2 ’ 1 * ei 2 ’ 2) ) ' 

where the concrete formulations of these four submatrices are 


-l 


©Bd) = _ [S^ 2 ’ 1 )] t [5]( 2 ’ 2 )]" 1 E^ 2 ’ 1 ) 

0( 2 d) = _0(hl)p(2,2) ] -l^(2 > l) ; 

©i 2 ’ 2 ) = [Si 2 ’ 2 )]- 1 - ©i 2 ’ 1 )[si 2 ’ 1 )] T [si 2 ’ 2 )]- 1 . 

Denote Aij(xi ,Xj) = \pij(xi,Xj) — p\{x\)pj{xj)\ for any j > 2. Using Assumption (Al) and 
|5.1l), we have the following upper bound 


d m 


EE sup Aij(xi ,Xj) < 


sup 


E 


j = 2 k = l x i eX ( xi ,..., x d ) eX d j=2 

-1 


Plj(xi,Xj) 


Pi(xi)Pj{xi, 


-1 


\ Pl ( xi ) Pj ( Xj )\ 


(C.2) 


< B Z M log -1 d 


In order to bound 9 Z = (©i 1,1 ), ©i 2,1 ) T ) T , we first bound the £\ norm of the second part 

d m 

ii^-’i-EE |E [KhiX-L - z)^ jk (Xj)}\. 

j =2 k =1 

By the triangle inequality, we can bound the norm by 

d m 


K h (x i - z^jk^x^piix^pjix^dxidxj 


ism, < EE 

j =2 k =1 

d m 

+ EEI / K h (x i - z)\i/>jk(xj)\A(xi,Xj)dxidxj\ 

j =2 fc=i 

d m 

sEE EE sup Ai.j(x'i , Xj ) 

3=2 fc=l 

< |A|B 2 Af log -1 d, 


d m 


j =2 fc=l 


where the last inequality is due to E[V’jfc(Aj)] = 0 and C.2 . Hence, we have 

Si 2 ’ 1 ^^ 2 ’ 2 )]- 1 ^ 2 ’ 1 ) < ||S( 2 ’%p^ n m < \X\ 2 B 4 M 2 p^ n mlog~ 2 d 


and we can also bound 


[si 2 ’ 2 )]- 1 ^ 2 ’ 1 )! 


i< 


I si 2 ’ 1 ) I 


p 1 


m < 


min — 


\ x \B 2 MpJ m\og 1 d. 
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Therefore, as long as y^mlog 1 d = 0(1) and = K[Kh(X\ — 2 )] = p\(z) + o(l), we have 


0 


( 1 , 1 ) 


> p\(z) — o(l) > 0 for any z G X. Combining with 


5.10 


, 1 exists for any z G X and 


sup ||0 z ||i < sup + sup ||G>£ 2 ’ 1) ||i 

z£X z£X z£X 


= sup 

z£X 


- S( 2 ’ 1 )t [E^’ 2) ] _1 S 


-lv(2,l) 


-1 


+ sup 

z£X 


ea.DpP, 2 )]- 1 ^, 1 ) 


z£X 

This completes the proof. 


sup {(pi( 2 ) + 0(1)) 1 + {pi(z) + 0(1)) 1 • 0(y/m)\ < Cy/m. 


D Auxiliary Lemmas for Estimation Rate 


In this section, we give detailed proofs of technical lemmas stated in Section 8.1 The principal 


technique used in the proofs of this section is the control of the suprema of empirical processes. In 
the entire section, we will abuse the notation of, as the variance for certain empirical process. 

D.l Restricted eigenvalue condition 


We provide a proof of Lemma 8.1 in this section. Before stating the main part of the proof, we 
begin with a technical lemma. 

Lemma D.l (Resctricted eigenvalue condition). With probability larger than 1 — c/(dm), for any 
6 = (a, f3 T ) T , with «£l and j3 G and any z G X it holds that 


e T ^ z e > e T ^ z e - c 


mlog(dm) m 1 

+ - + -F= + fc ll| 0 |ll 2 > 


nh ' nh y/ffh 

where ||#||i ,2 = |«| + ||/3||i, 2- Moreover, for any j G [d] there exist a constant C such that 


sup 1 U’T.j< Cm. 1 . 
X £xn u 2112 


Proof. To simplify the notation, we use ifj k {x) to denote ifjk- z ( x ) defined in 3.3 . The proof 


strategy is to study suprema of the entries of E 2 — S 2 . We denote the (u, v ) entry of S 2 as S 2 (u, v) 
and similarly for S 2 . Let E n denote the empirical expectation. We first study the random variable 

Zkk’jj’ — SUp(E n E) [R+(^f Z^lpjk (X{j 'j'lfj'k' (.Xiji )]. 

z£X 

Notice that Zkk'jj’ = sup ze ^[S 2 — E~](l + (j — 2 )m + k, 1 + (/ — 2)?n + k') for j, j' > 2 and k G [m]. 
Using the standard symmetrization argument (see Lemma 2.3.1 in van der Vaart and Wellner 
, we have 


1996 


E[ Zkk'jj '] < 2E 


swp£iK h (Xa ~ z)'ipj k {Xij)ilj j f k f(Xi j f 
Izex 


where are i.i.d. Rademacher variables independent of data. In order to bound the right 

hand side above, we turn to study the covering number of the space 

Qh = {g z (x 1 ,X 2 ,X 3 ) = h~ 1 K(h~ 1 (x 1 - z))ifjk(x2)ifj'k'{.X'3) I 2 ; G <T,xi,X2,x 3 G X} ■ 
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Let J~h = {h 1 K(h 1 (- — z)) \ z € A'J and let ||/L||tv be the total variation of K(-). From Lemma 

/ 2||X|| T vA\ 4 


F.2 we have 


sup N(T h ,L 2 (Q),e) < 
Q 


\ he J 


0 < e < 1, 


where the supremum is taken over all probability measures Q on M. Here we take Q = Pxi = 
n _1 Y2i=i Sx a and let Th be an e/L- cover of Th with respect to Px 1; where L > ||V’jfc||oo for any k. 
We construct an e-cover for Qj x with respect to P n = n _1 ^” =1 ^x iA ,...,x ld as 

Qh = \^h{xi)^jk{x2)^j'k'{xz) | fi G 4} • 

For a function g z = h~ l K(h~ 1 {x\ — z))i/jjk(x2)'ipj'k l (^ 3 ) £ Qhi let 

g z = h~ 1 K(h~ 1 (xi - z))^jk{,X 2 )^j'k' (^ 3 ) 6 Qh 

be the corresponding element in the cover. Here h~ 1 K{h~ l {x\ — z)) E Th, is the corresponding 
element in the cover for h~ 1 K(h~ 1 (x\ — z)) E Th- Then 

n 

II 9z ~ 9zf L2( p n) = n- 1 Y [fa -1 (Kh(Xn - z) - K h (X n - z)) ^(X^^X^)] 2 
1=1 
n 

< n- 1 Y [(h- 1 (K h (Xi 1 -z)- K h (X n - S))] 2 < e 2 

i=l 

and the covering number can be bounded as 




he 




(D.l) 


Observe that all functions in Qh are bounded by U = Ah 1 ||.K’|| 0O and 

> 2 " 


4 := E [(K h (Xt - z) tyjkiXjWj'viXj,))' 

= h~ 2 E [lK 2 {h~ 1 (X 1 - z)) E [(tfkiXjWfaiXj') | Xr]] 
< L 2 m~ 1 h~ 2 'E [K 2 (h-^X 1 - 2 ))] 


= L 2 m 2 /i 1 j K 2 (u)px 1 (z + uh)du < bL 2 m 2 h 1 , 

where the first and last inequalities are due to Assumption (Al). The bound above does not depend 
on the particular choice of z. If m{nh)~ l = o(l), we have na P > CU 2 log ( Ua^ 1 ), and from Lemma 


F.l we have 


^[Zkk'jf] < 2 E 


sup ^K h (Xp z ) 'ip j k (Xjj ) ipj / h/ (Xjj >) 


zex 


< c\ 


log(C 2 m) 
nm 2 h ’ 


(D.2) 


where the constants C\. C 2 are independent of k,k*,j,f. As \Zkk'jj'\ < Ah 1 and a 2 P < Cm 2 h 2 , 
we can apply Lemma F.3 to obtain 

P ^Zkk'jj' > E[Zj kk'jj'] +t\JCm^h -1 + Ah~ 1 E[Zkk , jj'] + 4f 2 /i _1 /3^ < exp(— nt 2 ). (D.3) 
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For t = log d/y/n, there exists a constant C such that 


with probability 1 — 1/d. Combining D.2 


Zkk'jj' < C log dm/Vnm 2 h + C/ (nh ) 

, there exists a constant C such that 


with 


D.3 


max \Zkk'jj '| > ‘^‘[Zkk'jj 1 ] + t\!Cm 2 h 1 + Ah + 4i 2 /t 1 /3) 

< P(" max \Zkk'jj' - > tJCm^h -1 + 4h _1 E[Z fcfc / J j/] + 4f 2 h _1 /3 + 

^ fcjfc'SlmJj ,j’>2 * / 

< (dm) 2 exp (— nt 2 ) . 

Let t = 3y/\og(dm)/n, rijk = 1 + (j — 2)m + and rij/k' = 1 + (j 1 — 2)m + A/) and we obtain that 


sup max 

ze* jj'>2,fc,fc'eM 


[^z ^z] iP'jki ^j'k' ) 


= °^ + 


log (dm) 
nm 2 h 


(D.4) 


Similarly, we define Z^j = sup z6i% -(E n — E) \Ky,{X l \ — z)ijjjk(Xij)\. Following the similar procedure 
as above, we apply Lemma [F.l | to obtain that for some constant C, 

,2 ,_ w bv. (v ( wOA 2 ) lj,—1 


4:=E (KhiX-L-z)^^))* <Cm- v h~\ 


and U < h 1 . which implies the following inequality 


E [Zkj] < 2E 


sup £iK h (X n - z)ipjk(Xij) 
z£X 


< C\ 


log (C 2 m) 
nmh 


(D.5) 


and 


We now turn to study the remaining entries of XI 2 — Xl 2 . Using the same arguments as in D.4 
, we can derive an upper bound on the difference 


D.5 


sup max 

ze* i> 2 


£z(l + U ~ 2 )m + k,l)~ Sz(l + (3 ~ 2 )m + k, 1) 


= 0 "U + 


log (dm) 
nmh 


(D.6) 


From Assumption (Al), the density function of X\, pi(x), is smooth. Recall that p\(z) = 
n _1 'Y/ r / = i Kh(Xn — z). Applying the suprem e norm rate for a kernel density estimator established 


in Theorem 2.3 of 
therefore we can get the rate 


Gine and Guillou 


2002 


, we have ||pi — 


sup 

ze* 


£ 2 (1,1) - S 2 (l, 1) = sup \pi(z) - E[pi(z)]| = Op 
zex 


= Op (y^og(I7 h) / (nK)) and 

(D.7) 


log (l/h)\ 


D.4 


Combining 

e T (% z - e 2 )0 


, D.6 


and D.7 


nh J 

, according to Holder inequality, we have for any z £ X 


— Il^lllll^z — ^z Umax 


< ||0|| 2 2 \ sup max m ’S lz (t,t') ~ S 2 (f, t') 




+ sup 

ZdX 


E 2 (l,l)-S 2 (l,l) 


< 


( Im log(dm) m j\og(l/h)\ 2 

C (V nh + ^h + \—nim) mi 


2 ) 
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which completes the proof of the first part of the Lemma. 
^^21 Tipper l"*'"* 11 A d i t~\ . m 111 iT/ \lrT ||2 


jound on sup^ ||2 can be obtain in a way similar to the proof of 


Lemma 6.2 in 


Zhou et al. 


1998 


i _ j For any (3j = (/3i,..., /3 m ) T , let u(x) = J2k=i Mjk(x). Let the 

joint density function between X\ and Xj is pi.j(x\, Xj). From Assumption (Al), we have for any 

z £ A, 


1 


1 


-(3jE[*. j W z *t j \f3j= I -K 


n 


x\ — z 
h 


u 2 (xj)pij(xi, Xj)dx\dxj 


< 


k =1 


Furthermore, we also have 


1 


/ ,. m 

K(u)du / u 2 (x)dx < Cm 

2 / 


(D.8) 


sup — & E[\E r . 7 -St r ., i | Ai = x]/3j = / u z (xj) 1 ' ±,JK J ' dx\dxj 
xexn J J J pi(x) 


b 

< — 
~ B 


/ ,. m 

K{u)du / u 2 (x)dx < Cm Ed 

' k =1 


(D.9) 


Let P n = n 1 SxnjXij- We write the integration as 


sup / Kh(x i — z)u 2 (a:j)(iP ri = /i + / 2 , where 
zeA” J 


z£X 

.2 


I\ = sup / Kh(x i — z)?i 2 (x'j)dPx 1 ,x- and /2 = sup 


zeAf. 


zeA” 


J K h (xi - z)u 2 (xj)d( P n - Pxi,x 3 - 


Due to 

D.8 

ct al. 

1 

998 


, we have /i = Op(m 1 )11/3^-111 and a similar argument to one in Lemma 6.2 of 
will derive I 2 = o(/i)||/3y |||- This completes the proof. 


Zhou 


□ 


Based on Lemma 


D.l 


the remaining step is to prove Lemma 8.1 


Proof of Lemma 8.1 We can derive the restricted eigenvalue condition on the cone from Lemma D.l 
We apply Lemma D.l in the last step. If the cone condition 

Y WPih < 3 Y Ill'll 2 + ^Vrn\a\ 

jeS c jes 

is satisfied, by Holder inequality, we have the upper bound 

||/3||i,2 < 4^ \\Pjh + 2,y/m\a\ < k^s\\f3\\ 2 + 3y/m\a\. 

j&S 

With large probability, we have the following inequality 

1 


9 t ’£,0 > 6> T £,0-C 


mlog(dm) m 


nh 


nh y/nh 


+ h 2 W i2 


Pmin | CK | T Pmin \ \ ft \ | 2 / Ttl C 

> Pmin|a| 2 /2 + / Ominm~ 1 ||/3|||/2 


mlog(dm) m 1 , 2 \ /, . , 2 . M _ M 9 \ 

- l - - + —r + -j= + hr (4m a 2 + 4s /3 |) 

nh nh y/nh K 


for any z £ X and sufficiently large n if sy/ m 3 log {dm)/{nh) + sm 2 /(nh) = o(l). 
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D.2 Proof of Lemma |8.2| 

The proof can be separated into two cases: j = 1 and j > 2. For the simplicity of notation, we write 
6i(z) as 5i in this proof. We first consider the situation when j > 2 and prove 
Lemma 


8.1 


8.1 


and 


8.3 


. From 


||wy^< wy 2 || 2 /v^ = sup WV.jW^h/V^ < P^m~ 1/2 


sup || T ■ z ^ ^ »i ' » z 11^/ V "• — '-"-m || •] ” z ^ • ; I 

z£X zeX 


with high probability. Therefore 

i|i*r,.w,*ii 2 < sup -iiwy 2 ^<wy 2 i| 2 ||wy 2 5|| 2 < SL . sup 1= nwy 2 <sn 2 . (d.io) 

V m zex V n 


sup n ^ 
ze* n J 


zex n 

To complete the proof, we need a bound on 


1 1 n 
sup —||Wy 2 <5||| = sup - z)S'f. 

. ^-yj 71 ^ v 71 < 


zex n 


Using Equation (20) in Zhou et al. 


1998 


n i=i 

on B-spline, we have 


2 -4 

•• m 


fjziXji) fnj;z(Xji)j < S 

3 =2 

Define the following empirical process 

1 n 

U n {z ) = - E -E[^(^n - *)<*i] • 

we have 


Z=1 


Applying Hoeffding’s inequality Hoeffding 


1963 


sup U n (z) — E 
zex 


sup U n (z) 
zex 


\ i n n h 2 t 2 


(DU) 


(D.12) 


Using symmetrization inequality (Lemma 2.3.1, van der Vaart and Wellner 


1996 


we have 


E 

sup U n (z) 

< 2E 

r i n i 

sup - V iiK h {Xn - z)5} 


.zex 


Vzex n fr[ J 


where {£j}™ =1 are i.i.d. Rademacher variables independent of data. Let 


G'h = {gz{xi, x 2 ) = h 1 K(h l {x\ — z))5 2 (x 2) \ z € A’, xi G A', X 2 € X d 1 j , 

where 5(x2) = fj( x 2j) ~ fnj( x 2 j )■ Similar to the covering number of Qh in i 

5 2 {x2) < sm -4 for any x' 2 , we have for any measure Q , 


D.l 


since 


sup AT (g'l, L 2 (Q),e) 
Q 


( 2\/g||Ar|| T vA \ 4 

V m?he J 
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Furthermore, a 2 P := E[K) r (Xn — z)8 2 ] 2 = 0{{sm 4 ) 2 h 4 ). Since g < U := Ch 1 (sm 4 ) for any 

F.l 


g € Q'l and m 4 (sn) 1 = o(l), we have ncip > C\U 2 log (C^Vsm 4 f7/cr). By Lemma 

—4 


E 


sup C40) 
zeA' 


< C Sm _ J\og{rn 2 /Vsh). 

\Jnh v 


we have 
(D.13) 


We set t = Cs(m 4 h) 1 ^logn/n in 
at least 1 — 1/n , 


D.12 


and combine it with 


D.13 


rr , . n sm 4 f / 2 /-r\ s^logn/n 

sup[/ n (z) < C —_ \/log I m l /v sn 1 + C-—-. 

zeA' vn/i V ' ' 


to obtain that, with probability 

(D.14) 


Finally, we bound the maximal of the expectation by 

sup F.[Kh(Xu - z)8f] = sup [ K h (t - z)dP Xl (t)8 2 (u)dP x \ Xl=t (u) 
z£X zGX j 


< Csm 4 sup / Kh(t — z)dP Xl (t ) < Csm 

zeA” J 


-4 


(D.15) 


Combining 


D.14 


and D.15 , with probability at least 1 — 1/n, we have 


1 1 n 

sup -||W 4/2 (5||2 = sup - V K h (Xn - z)5 2 

. ^ ^ n. n < 


z&x n 


ex n 


i= 1 


< sup U n (z) + sup K[Kh(Xn - z)5i] 

z&X z&X 


< c’-^L '/log(mV^) + + Csm 

~ V^h v m 4 h 

= 0(sm~ 4 ), 


-4 


(D.16) 


where the last equality is due to 2/y/nh 2 = o(l). Therefore, we prove the upper bound in 
Comibing 


D.16 


with D.10 


, we have we can conclude that 


5.3 


su p maxi||^ w ^ll 2 < 4= ■ sup ' ||W 4/2 5 || 2 < cj^. 

z&x J > 2 n \Jm 2e ^ yjn 




m 


This gives us the rate in 8.1 . Therefore, we complete the proof by bounding all the three inequality 
8.1 , 8.2 and 8.3 . 

The final step is to prove 8.2 . Recall that T'.i = (1,..., 1) T . For the case j = 1, following the 
proof for 8.3 . According to D.ll , we have |5j| < sm~ 2 for any i G [n]. Let 


We use Hoeffding’s inequality Hoeffding 


1 n 

U' n {z) = ~Y1 Kh ( Xil ~ ^ ~ E l R h ( X il - z)6i]- 

again and obtain 


i=l 


1963 


sup U' n (z ) — E 

zGX 


sup U' n (z) 
zex 


> t ) < exp — C 


nh 2 t 2 


sm 


-4 


(D.17) 


46 










































Applying symmetrization inequality again, we have 


E 

sup U' n (z) 

< 2E 

1 n 

sup CiK h (Xi 1 - z)\5i\ 


.zGX 


lzex n f^ J 


where {£«}” =1 are i.i.d. Rademacher variables independent of data. Let 

Q'h = {g z (xi,x 2 ) = h~ 1 K(h~ 1 (xi - z))S(x 2 ) | z E A, x\ E X, x 2 6 A"* -1 } , 


where S(x 2 ) = | fj( x 2j) ~ fnj{% 2 j)\- Just as the covering number of Q", we also have < 5 ( 0 : 2 ) < 

sm~ 2 for any X 2 , we have for any measure Q, 


sup N \G£,L 2 (Q),e J < 
Q 


2s 1 / 2 ||/\ || T yA 
m 2 he 


4 


The variance of the process a p := — z)<5i] 2 = 0(sm~ A h~ l ). Since g < U := CTi^sm -4 ) 1 / 2 

for any g E Q'^ and m 4 (sn) _1 = o(l), we have na p > C\U 2 log(C , 2S 1 / 4 m~ 1 f7/cj). Applying 
Lemma [F. 11 again , we have 


E 


sup U' n {z) 

.z&X 



\J\og{m / yfsh). 


(D.18) 


We let t = Cy/s(m 2 h) 1 y / logn/n in 
probability at least 1 — 1 /n , 


D.17 


and use it with 


D.18 


Therefore, we achieve with 


sup U' n (z)<C- 
z&x \Jnh 

We again bound the supreme of the expectation 


dsm 2 f, 7 7 , rr\ „\/s log n/n 

jwrr og i m l ^) +c M • (D19 > 


supE[A'/ l (An - z)5 1 ] = sup / Kh(t - z)dP Xl (t)5{u)dP x>2 \ Xl=t (u) 
zex zgx J 


< C\fsm 2 sup / Kh(t — z)dP Xl (t) < Cyfsm 

ze* J 


-2 


(D.20) 


Combining 


D.19 


and 


D.20 


with probability at least 1 — 1/n, we have 


1 1 . , 
sup -|^7 iW.< 5 z | < sup V K h (X t i - z)|<5j| 

zgx n zexnf^ 

< sup U' n (z) + supE[iL/j(Xn — z)5i] = 0{y/sm~ 2 ). 

z£X z£X 


Therefore, we prove the upper bound in 


8.2 


which completes the proof of the lemma. 
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D.3 Proof of Lemma |8.3| 

For j > 2, we bound the two terms sup 2eA maxj> 2 ^W'&^.WzZzh and sup zeA , max J > 2 , 1 J| V E , I ? -W 2 C 2 || 2 
separately. To bound the hrst term, let A f~(x) = f\ (x) — f\ (z) and 'S> 1] be the ith row of T*. j- We 
can rewrite the suprema as 

1 i n 

sup max -||T'^-W 2 £ 2 1| 2 = max sup sup - K h (X a - z)Af z (X li )v I T'y. (D.21) 

z^x i> 2 n i>2 2eA ^ veB m n 


Let N v = {vi,... ,vm) be a 1/2-covering of the sphere B m = {v E M m j ||v|| 2 < 1}. Observe that 
for any v E B m , there exists 7 r(v) E N v such that ||v — vr(v)|| 2 < 1/2. Therefore we have 


1 

sup - V] K h (Xi i - z)Af z (X u )v' 1 T-jj 
veB™ n f-f 

n i n 

< sup - y; K h (Xg - z)Af z (Xu)Vk 'S’ij + sup - ^ - z)A/ z (Xi,)v r ^j 

ke [ M ] n ve|B m n 

1 n i i n 

< sup - V] K h (Xn - z)Af z (Xu)Vk'&ij + o SU P ~ Ff/^Xq - z)Af z (X 1 i)v T '& i j. 

ke[M] n ~{ 1 veB™ n “ 


If we move the second term on the right hand side of the last inequality to the left hand side, we 
obtain the inequality that 


sup 

vSB m n 


- n j n 

-Y^KhiXa - z)Af z {X li ) V T ^ ij < 2 sup - V K h (X u - z)Af z (X li )v%* ij . 
” ke [ M ] n 


i =1 


Therefore, in order to bound 


D.21 


we need to study the following empirical process 


V n {z) = max sup -V K h (X u - z)Af z (X li )vl'& ij - E[I< h (X u - z)Af z (X 11 )vl'& ij ] 1 . 

j>2,ke[M] z ex \ n J 

We define the following function class 

Q’h = [g z (x 1 ,X 2 ) = h~ 1 K((x 1 - z)/h)Af z (x\) ^ v kt ipt(xj) j >2,k E [M],z E X j 
^ t=i J 

and, similarly to argument in the covering number of Qh in D.l , we have 


sup N [G'h i L 2 (Q), e) < dM 
Q 


2y / m||/L||TVz4 \ 4 

he J 


From 


D.9 


, we bound the maximal of the expectation by 

supE[(v^y) 2 \Xi=x\ < C||u fc || 2 m _1 . 


48 









Furthermore, we can bound the variance by expanding the expectation as the integration and 
applying the Taylor expansion as follows 


4 := m'hiXi - z)Af z (X 1 )x^ ij f 


= h 


-2 


K 2 


x — z 


(/lO) - f 1 (z)) 2 px 1 (x)dx • E [(vf'Fy) 2 I Ah = x\ 


< C'(mh) -1 J K 2 (u)(fi(z + hu) — fi(z)) 2 px 1 {z + hu)du 

= C'(mh)” i y K 2 (u)(f[(z)uh + o(uh)) 2 (px 1 (z) + Px 1 (z)uh + o{uh))du 

= (z)] 2 px 1 (z) J u 2 K 2 (u)du ■ h + o{rrT l h) = Chm~ l . 

The uniform upper bound of Kh{x — z)Af z (x) can be studied under two cases: (1) x is out of the 
support and (2) x is in the support. In particular, we have 

• if x £ [z — h, z + h], then K^x — z)Af 2 (x) = 0; 

• if x G [z — h, z + h], then, by mean value theorem, 

K h (x-z)A z f(x ) < h~ l K{h~\x - z))\f(x)-f(z)\ < h- 1 \\K\\ 00 \\f'\\ 2 00 -{2h) = 4||/{ ULll^lU- 

Combining with the fact that <_\/m for any i, j, k, we conclude that g < U := 4||/{ Woo^/m, 


for any g e Q ”. Therefore by Lemma F.l and M = 6 ,n , we have 


W„(z) < • log (dMh-') 


ran 


n 


!h\og(dh~ 1 ) ^m 3 ^ 2 log(dh x ) 


n 


n 


(D.22) 


Similar to the analysis of dp, we also expand the expectation of the process as the integration and 
use the Taylor expansion to bound it as follows 


E[K h (X 1 - Z )Af z (X 1 )vl* 


yj 


= h 1 J K (M x ) - fi{z))px 1 (x)dx • E [vfcT'ij \Xi = x] dx 

= J K(u) [f[(z)uh + f1(z)(uh) 2 /2 + o(uh ) 2 ] [p Xl {z) + p' Xl (z)uh + o(uh)\ 
■ ^E j | X\ = z] + uh-^-E [vj^ij | X\ = z] + o(uh)^j du < Ch 2 


(D.23) 


m. 


The last inequality is due to the fact that K(-) is an even function, 


< 1 and from 


D.9 


Moreover, the constant is independent to j, k and z. Using Lemma F.3 we have 

/ /- 2 Ut 2 \ 

P (v n (z) - EKO) > tyj 2(4 + 2UEV n (z)) + — J < exp (- nt 2 ) . (D.24) 
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Combining D.22 
have 


and D.23 


with 


D.24 


for t = \/logn/n, with probability at least 1 — 1/n, we 


1 1 . 71 ^ 
sup max -||^W 2 £ 2 || 2 < 2 max sup - Y] K h {X u - z)Af z (X u )vl& 
zex j> 2 n j> 2 ,fce[M] 2e ^ n “ 

< K(^) + max sup E[iv/ l (Xi - 2 )A/ z (Xi)v^ ij ] 
ze* 




(D.25) 


< C 


/ilog(d/r _1 ) m 3 / 2 log(dh 3 ) h 2 
-b C-b C 


n 


n 


m 


where the last equality is because of n _1 /i = o(l). 

Now we bound sup zg<v maxj> 2 ^The procedure is similar to the first part of the 
proof. We again apply the 1 /2-covering of so that 


1 1 n 

sup max — H^LWzCzlh < 2 max sup -S' K h (Xn - z)Q{z)vl^ 
z&x i > 2 n j>2,fee[M] ze x n 


*r 


Motived by the above argument, we now turn to study the following empirical process 

V„(z) = max sup | - V K h (X u - z)(i(z)v'[^ ij - E [K h (X n - z)Ci(z)v^ij] 1 
j>2,ke[M] ze x [ n J 

and the function class inspired from the above empirical process 

G'h = lgz(x i,x 2 ) = h~ l K{{ Xl - z)/h)Q(z) Y Vkt'Mxj) j >2,k <E [M],z,x 1 ,x j € <t|. 

^ t= l ' 

Our method of bound the supreme of the process is same as the proof of previous lemmas. We need 
to study the covering number of the function space. Assembling the concentration inequality of the 
suprema with the upper bound of the expectation of suprema and suprema of the expectation, we 
will arrive at the final bound. Therefore, we first bound the covering number 

supiV (C',T 2 (Q),e) < dM 
Q V 


Using Definition 2.2 there exists L\(z,x\i) such that the approximation error is bounded by 

d 

' < \Li(z,x v )\ • \Xn -z\ + {UjWKXn - zf. 


Uz)\ = 


f(Xa ,... .X id ) - Y fiz(X, 

3 = 1 




Therefore, the variance of the process V/ can be bounded by computing the expectation 

<r% := E [KhiXi - zXiiz^l^ij] 2 < Chm~ l (D.26) 

and g < U := 4||Li(2f, x\ 1 )||^ 0 ||A'|| 0O y / 7B for all g E G'h ■ Lemma 


F.l 


gives us 


E V^z) < C 


h\og{dh 3 ) m 3 / 2 log(d/i x ) 

-b (-• - 

n n 


(D.27) 
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2.2 


Denote k(x) = E\Li(z, Ab^) | X\ = x\. Using Definition 
nmxi - zXiiz^Vij) 

= h- 1 J K (^-^j mi(z,x v ) I Xi = x]( Xl -z) + Uj(z)(x - z ) 2 ) 

• E [v^i,- | X\ = x] Px 1 {x)dx 

= J K(u)(n(z)uh + (k"(z) + ||£/ 7 j| 0O )('uh) 2 /2 + o(uh) 2 )(p Xl (z) + p' Xl (z)uh + o(uh)) 

■ ^E [vjk’M'i j | X\ = z] + uh-^-E | Xi = z] + o(uh)^J du < Ch 2 /\/m. (D.28) 

The last inequality is due to Assumption (Al). Since X is compact, E [v^’Fi j \ X\ = z] and k(z) 


D.26 

, 

D.27 

and 


D.28 


part of the proof, 
least 1 — 1 /d, we can bound the suprema 


F.3 


1 rp 

sup max- \l>.-W 2 Cz 2 < C 
u .r i> 2 n 3 


similar to the first 
can yield that for some constant C, with probability at 

(D.29) 


hlog(dh 4 ) m 3 / 2 log(d/i 3 ) h 2 

-b C-b C — 

n n Wm 


Combining 


D.25 


and 


D.29 


we 


have the rate of sup zgA - maxj> 2 ^||^ , L W 2 (^ Z + £ 


z 2- 


For the case when j = 1, ’J/.i = (1,..., 1) T £ M n and we can follow similar procedure to derive 

1 I n 

sup -||^3 iW z (£ z + C *)|| 2 = SU P ~Y,Kh{z - X a )(&(>) + £(>)) = Op (h 2 + y/hjn) . 
z& x n" n z — J \ J 


zex n 


2—1 


The hnal step is to bound sup 2gi *> ^||W Z £ 2 ||| and sup 2g;t ^||W Z C 2 |||. We just repeat the 

procedure again and consider V”'(z) = sup. gA -n _1 Y%= i K h( x u ~ z)£i(z) - E [I\ h (X u - z)£i(z)]. 
First, we find that 


EV'"(z) < C 


h 3 log(/i l / 2 ) hlog(h 4 ) 

-b C- 

n n 


(D.30) 


Next, we have the upper bound of the supreme of the expectation 


sup E[I\h(Xi - z)( 2 (z)\ = h 

z&X 


-l 


K 


x — z 


(/i0) - fi(z)) 2 p Xl (x)dx 


|D.30 

and 

D.31 

with Lemma 

F.3 


= J K(u)(f[(z)uh + o(uh)) 2 (p Xl (z) +p' Xl (z)uh + o(uh))du 
= if'( z )] 2 Px 1 (z) J u 2 K{u)du ■ h 2 + o(h 2 ) < Ch 2 . (D.31) 

with t = lo gn/y/n, with probability at least 1 — 1/n, 


1 

sup - E K h {Xi 1 - z)£ 2 (z) < V n (z) + sup E[K h (Xn - z)£ 2 (z)\ 
z&x n “ ze* 


< C 


/r 3 log(ro/i -1 ) h 5 / 4 log(n/i 4 ) h log 


n 


+ C 


n 3 / 4 
-2(„\ _ „ n2- 


+ + Ch 2 = 0 {h 2 ) . 


n 


Similarly, we also have sup, gi ^ n 1 YHi=i Kh(Xa — z)(f(z) = op(h 2 ). 
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D.4 Proof of Lemma |8.4| 

For j = 2,..., n, we define the process 


Gn (z,k,j) = -Y^-K 

sjn ' h 
i =l 


1 „ (Xn-z 


h 


^jk(Xji)£i. 


Since e, are subgaussian random variables, we have P(maxj |ej| > Cy/\ogn) < 1/n. Conditioning on 
the event A = {max* \ei\ < Cy/\ogn\, we can apply the Mc’Diarmid’s inequality to obtain 

. , . , , , „nh 2 t 2 

max sup G n (z, k,j) \ A 

z^X 


max sup G n (z, k,j ) — IE 

V j’ k z£X 


> t j A ) < exp ( — C 


log 2 n 


(D.32) 


Next, we bound E[ max,-^ sup^ Ft y G 


2.2.5 in van der Vaart and Wellner 
1 — 1/n, there exists a constant (J 


1996 

such 


[z, k,j) | A] . Using Dudley’s entropy integral (see Corollary 
), conditioning on {Xy} ie [ re ] je[d], we have with probability 
hat 


E 


max max sup G n (z , k, j) I A 

l<j<dl<k<m z£X 


< E 


uo 


log N(g' h , L 2 (P n ), e)de \ A 


where P n = n 1 Ya=i f>x iU ...,x id , <*n = maxi<j< d rnaxi< fc < m sup zeX F n [K h (- - z)^ jk (-)] 2 and 
G'h = {dz(x 1,X 2 ) = h~ 1 K(h~ 1 (xi - z))ipjk(x 2 ) \1 <k <m,z € X,xi,x 2 £ X) . 


From Lemma F.2 and similar to the previous computation on the covering number, for any measure 
Q, we have the uniform upper bound of covering number as 

supiV(g',L 2 (Q), e )<dm(^M^) . 


We following a similar argument as in the proof of Lemma 8.1. We bound the variance of process 
by Cauchy-Schwarz inequality as 

4 := E (K h (X 1 - z) jk (. Xj) | A) 2 

< [. K 2 {h~\X 1 - z)) E [tf k (Xj) | X 1 ] | 4 < bWKWKmh)- 1 , 

and g < ||iF||oo^ _1 for any g G Q' h . Since m(nh)~ l = o(l), we have na 2 P > h~ 2 log(dm(2| \K\ |tv4 (her))). 
Therefore, by Lemma F.l. we derive that 

E 


L/o 

and E 


log N(g'L 2 (F n ),e)de 


A 


< CapyJ\og(dm(hap) 4 ) < C 


log (dm 3 h~ 2 ) 


mh 


max sup G n (z,k,j) A < Cy/\og(dmi i h 2 )/(mh). 

o,k z&X 


Choosing t = C log 2 nj(y/nh) in 


D.32 


, we have 


rnaxsup G n (z, k, j) > C 

, 3,k z&X 


log (dm. 3 h 2 ) log 2 n 


mh 


y/nh 


< P max sup G n (z,k,j) > C 
\ ih ^g* 


log (dm 3 h 2 ) log 2 n 


mh 


y/nh 


A + P(Pl c ) < 2/n. 
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Therefore, when m(nh ) 1 = o(l), with probability 1 — 2/n, 


sup max — < </— max max sup G n (z,k,j) 

ze x j> 2 n 3 V n i<j<di<k<m zeX 


When j = 1, recalling that 'T.i = 1) T £ M n , we have 


J. - rri 

sup -||vl>. 1 W z e || 2 = sup 

ze* n ze* 

and, similar to the case when j > 2, we can show that sup 2e ^ n 2 |[T , ^ 1 W s e || 2 < C^log^h- 1 )/(nh) 
with probability 1 — 2/n. This completes the proof. 


1 

^ ' Xh(z 
n ■' 


< 2C 


log (dmh 
nh 


-L 


E Auxiliary Lemmas for Bootstrap Confidence Bands 


In this section, we describe the proof of these technical lemmas used in Section B. In Section 


and Section E.3 we give proofs of lemmas in Section B.l for the proof of Theorem 4.4. Section 
to Section E.7 provide the proofs of lemmas in Section B.2 supporting the proof of Theorem 


E.l 


E.4 


5.5 


E.l Proof of Lemma B.l 


We denote the rate r n of the estimated function shown in Theorem 4.2 

-1 i slog(dh~ 1 ) 


as 


r n ■■= 


s 2 log (dmh 


nm 


- 2 h 


+ 



-7 + 


nm 


- 5/2 


+ s\fmh , 


we first claim a lemma on the estimation error of e\. 

Lemma E.l. Let e.\ = Y\ — /(JQi,..., X^) for i = 1,..., n. Under Assumption (A4), we have 


P ( max | e/ — £j 

V*e[n] 


< 2CV n 



> 1 - 

n 


If/ixn s , m x n 6 for <5 > 1/5, we have r n y/m. = o(n 1 / 5 ). 


We defer the proof of the lemma to the end of this subsection. With the rate of max ig [ n i |e* — e*|, 
we can first bound the rate of a 2 — a 2 . Using the triangle inequality, we have 


1^2 21 
a — a 


^ n r> n 71 

—— y^fe _ e i ) 2 h— y^ i {pi— e i) e %\ + _ y^ 

n z —' n z —' n z —' 


2=1 


2=1 


- > 


i =1 


II 


III 


From Lemma E.l we have the convergence rate of the noise estimator 


P(I > 4c?’ 2 to) < P ( max |e/ — e,| 2 > 4cr 2 m ) < 1/n. 

v e M / 


(E.l) 


(E.2) 
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Under Assumption (A4), e* are subgaussian random variables with variance-proxy cr~. Using 
Berstein’s inequality, we have 


n 

ly > 2 

n 

2—1 


2 -<7 2 


>c /OllognWa and p 

n 1 n 


1 n 

-VN — E|U 

n 


i= 1 


> C 


a 2 log n 


n 


2 

< 

n 


(E.3) 


Suppose n is large enough, so that E|e| < \Ja 2 log n/n. We now can bound the second term by 
P(II > 2c(c+ l)E|e| r n y/m) 

< P ( max I £j — sA > 2 cr n y/rn) + P ( — led > Kiel + c\ - 

\* e M J \ n ~[ V 


' ai log n \ 3 


Applying the fact that E|ej| < a 2 , we have the upper bound of the third term as 

p(m > 2c(c + l)a 2 r n y/rn) < 3 jn. 
with 


(E.4) 


Combining E.2 


E.3 


, E.4 


E.l 


Moreover, using Theorem 2 in 


Louani 


, we have the estimation rate of the variance of noise as 
(| a 2 — a 2 1 > C\r n y/rn ) < 6/n. (E.5) 

, there exists a constant C such that 


1998 


(||pi Pi 11 OO — C\/logn/(nh)^j < l/n. 


Since b < p\(z) < B for any z £ X, we can show that the estimated density function is bounded 
from above with high probability as 


P ( sup pi (z) < 2B J <1/ n, 
\z&X J 


(E.6) 


for sufficiently large n. Let C 2 = 2(a 2 bB x ) l Ci + b l C\. We have 

a 2 pi(z) 


P sup 

V-ze* 

< 


a 2 pi(z) 


- 1 


> C(r n y/m + y/\ogn/(nh)) 


S ir | sup 
>zex 


|pi( 2 )| |u 2 - (J 2 | 2C 1 r ny /m\ f a 2 \pi(z) - pi{z)\ Ci /logn 
-A- 1 > —--— I -I- P I sun--- < 1 /- 


a 2 b > a 2 bB Ll ] +P l^ a 2 b 


nh 


< P ( sup \a 2 — a 2 1 > Cir n y/m \ + P ( sup p(z) < 2Bj 
\z£X J \z£X J 


(E.7) 


+ P sup \pi(z) -Pi(z)\ < Cl 

\zex V nh 


log n\ 8 


< -. 
n 


Applying the inequality that 


o\Jpi( z 


- 1 


into 


E.7 


&y/pi(z 

, we complete the proof of the lemma. 


< 

V 2 Pl(z) 1 


a 2 pi(z) 


Now we come back to prove Lemma E.l 
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Proof of Lemma E. 1 Recall that the estimator of the true function is 

d m 


Similar to Lemma 


Huang et al. 


2010 


8.2 


f(x i, ...,x d ) = fi(xi) + EE kTpjk (Xj ) • 

3=2 k =1 

let Si = X] j= 2 fj(Xji) — fmj(Xji) and the B-spline theory (see Lemma 1, 


From Theorem 


4.2 


max | e * — e% | = max 
iGfnl isfnl 


) that Sf < sm 27 . Define the event 

£ = |sup | \frri\a z - f*(z)\ + ^ ||/% - /3j|| 2 | < CV n j . 

we have P(£) > 1 — 1/n. Conditioning the event £. we have 
f{X il ,...,X id )-f(X il ,...,X id ) 


< max \fi{Xa) - fi{Xn)\ + max ^ ^(Pjk ~ Pjk)^jk{Xij) 


«6h 


t6n 


d m 


j =2 fc=l 


+ max | 


»6n 


< sup 1 / 1 ( 2 ) - fi(z) \ + y/m'Y] \\(3j - (3j 11 2 + \/im 7 < 2 Cy/mr n , 

7^2 

where the second inequality is because of Hooder inequality as well as the fact that ifjk < 1 for all 
j, k and the last inequality is since we are conditioning on £. □ 


E.2 Proof Lemma B.2 


The strategy of the proof is to approximate the processes from 


B.l 


to 


B.4 


by a sequence of 


processes. In order to demonstrate the approximation, let b n = C log n and we introduce a chain of 
processes connecting the above three processes as follows: 


e S. 1) w = ^E& 


Z ) 


1 °V h l K K )v i(z)’ 


\/n “ 


Z ) 


i— 1 
n 


&\/h 1 \(K)pi(zY 

^ z') 

i=1 °'\/h~ 1 X(K)p 1 (z) 

In addition to the processes above, eecall that we also have G n (z) being the Gaussian process 


e l 3) w = 4E6' 


satisfying condition HI in Section B.l Our proof can be divided into four major steps: 


Step 1 
Step 2 
Step 3 
Step 4 


Bounding sup 2g * G n (z) - sup 2g<¥ ^( 2 ), 
Bounding sup 2gA . &n\z) - sup 2gA >G n\z), 
Bounding sup 2gA - G n\z) - sup 2g * Gn 2) ( 2 ), 

.—. /q\ 

Bounding sup 2g * G n ( 2 ) - sup zgA > G„ ( 2 ). 
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The method to bound these terms follows the procedure developed in 


Chernozhukov et al. 

nrr 


2014a 

nr 


We define E^[-] = E[-1 {Xij, £i}j e r n ] j e ui] an d similar for P^(-) and Varc(-). Let Wn ; = sup 2gi ^ Gh (z ) 
and recall that W n = sup 2giV G n (z) and W n = sup 2gi ^ G n (z) 

For the Step 1, we first bound the difference between W n — . Define AfiW(z) = Gn' 1 (z) — 

G n (z). For any z,z' G X, the L 2 difference between AG^^(z) and AGW(^) can be bounded as 


E^[AG (1) (z) - AG (1) (V)] S 


< 


n 


E 

i =1 _ 


£iK h (Xn - z) ( a\JPi(z) 
ay/h^X^piiz) \dy/pi(z) 


-1 - 


£iK h (Xn - z) ( a\Jpi{z) 
cr^h- l \(K)pi(z) l Sy/pxiz) 


- 1 


Define the event A\ = {max* |e~j| < 2C'\/logn}. We have P(A.i) > 1 — 2/n for sufficiently large n, 
using P(maxj |£j| > C\/logn ) < 1/n together Lemma E.l 


We also define 


A 2 = 




a 




- i 


E £3 n 


Similar to the proof of Lemma B.l we have P(^ 2 ) E 8/n. On events Ai n A 2 , we have 

^ € X > C G'k := |a • g z (') 9z £ Gk , |o| E £3n| 


G'k ■■= 

Using 


B.7 


£jK h (Xn - z) [ ay/prjz) _ 
a y /h- 1 X(K)pi(z) yd^/pitz) , 

on Ai nA- 2 , we have for any measure Q, 


sup N(G' K ,L 2 (Q),C g £ 3n h 1/2 y / log?r • 
Q 


< sup N(GK,L 2 (Q),C g £ 3n h 1/2 x /logn • r) < f -liMlE 
Q \ T 


A 


Applying Dudley’s inequality (Corollary 2.2.5 in van der Vaart and Wellner 
there exists a constant C' g such that for sufficiently large n, 


1996 


(E.8) 

), on Al n A 2 , 


Ec 


sup|AG«(E)| 

_zEX 


<Cg£ 3n h 1/2 ^log n\J 5 logiCg^nh^^y/iogri) < C' g £ 3n h 1/2 log 2 n. (E.9) 


Moreover, on A\ n A 2 we can bound the variance as 


1 

sup Var^AG^z)) = sup - ^ 
■ze* zex n 


£iK h (Xn - z) ( ayjpi(z) _ i 
°\fK K )Pi( z ) V J \fW(z) j 


1 2 


(E.10) 


< (C g h 1/2 y/log n£ 3n ) 2 
Using Borel’s inequality (Proposition A.2.1 


van der Vaart and Wellner 


(1996 ), we have 


5 sup 

KZEX 


AG^(z) — E^ sup AGj, 1 ^) > t ) < exp ( — \t 2 / sup Var(AG^(z))) . (E.ll) 

[zex \ J V 2 1 zEX ) 
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Plugging E.9 


E.10 


into 


E.ll 


and setting t = C„h 1 / 2 logn£ 3 n := o n i, we have 


f^(w n -W^ >2 Cgh 1/2 logne 3n ) < ( sup AG^(z) >2 C g h 1/2 logn£ 3ri ) <-. (E.12) 

v ' / n 

—"—fl') "— -(2) —f2) 

For Step 2, we bound WA ; — WA ; where WA ; = sup ze ^ GA Tz)- We follow a similar argument 
to that in the previous step. Let AG^(z) = G^ 2 ^(z) — G^^z). For any z,z' £ X, on the event 
M 3 = {maXj G r n | | Ei — £i\ < 2 Cr n y/rn) with P(A 3 ) > 1 — 1 /n, we can derive 


E 5 [AG (2) (z) - AG (2) (A)] 2 = -J2 


2=1 




1 2 


°\/K k )pi( z ) 


asJK K )Pi( z ' 


< 4C 2 r 2 m • — 

n r—' 


2=1 


A fe (X a - z) _ JL fe (Wi - A) 

° V x ( k )pi( z ) a V H k )pi( z> ) 


1 2 


Define the function class indexed by the covariate z £ A as 

= {^(W) = 2 Cr n y/mK h (X - z)/y/a 2 h\(K)pi(z) 


z e x 


}• 


Using Lemma F.2 the uniform covering number can be bounded by 


sup N(G%, L\Q), 2 CCgh-^yM ■ r) < 


Q 


21| AT 11 tv A 


Similar to (E.9|, using the Dudley’s inequality, we have 


E, 


.2 ex 


sup |AG^ 2) (z)| < 2 CC g h~ l/2 r n y/rn\/A \og(CC g h 1 / 2 r n y/m) =: Cgh x ^ 2 r n yfrn\ogn (E.13) 


for certain constant C” and for sufficiently large n. Moreover, on A 3 , we have 


sup Var^ 

z£X 


1 


Z = 


sup-J^ 


zGA' n 


2=1 _ 


(£j - £j)K h (Xii - z) 

°\/KK)pi{z) 


- 2 

<(2 CCgh-^rny/rn) 2 . (E.14) 


It follows again from Borel’s inequality that for a n 2 = 2 C''h 1 / 2 log nr n y/m = o(n 1 ^ 10 ), 


WW - > a n2 ) < P e fsup AG ^(z 

y \zGX 

-i(2) i /p(3) _/p(3) ttj> r/p* (^)l 


> a n 2 < 

n 


Next, we approximate GA ; by G^ 0 ; = GA — E[GA q. Define the event 


A 4 — 


\ maxej < b n := C^logn } , 
l *e[n] J 


(E.15) 


for any z £ X. According to the maximal inequality for sub-Gaussian random variables (see Lemma 

), we have P(M, 4 ) > 1 — 1/n. Therefore, under M 4 , the 


2 . 2.2 in 


van der Vaart and Wellner 


1996 
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difference AG^ 3 ^) := G n\z) — G^q (z) can be bounded by 


5(2) 


(3) / 


ag( 3 )(2) = ^X>-{ 

l 

n 

i 5 >- E 

/n < 


2—1 


^ bn)Kh(Xn z ) 

a y /h- 1 \(K)pi(z) 

5” bn)k.h(Xn z) 
a \/ h~ l X(K)pi(z) 

1 


- E 


£ i-^(| £ i| ^ b n )Kh{Xn z) 
a y / h- 1 \(K)pi(z) 


< E[£ 2 ] 1 / 2 P(k| > bnf^Cghr 1 ! 2 — VV < cCgn-^h-^—Y^ti. 

Denote a n 3 = aC g rT 1 / 2 h~ l ^ 2 log n = 0{n~ 2/l5 ) and WrP = sup 2giY G, (3) . On A 4 , we have 

> a n3 ) < P 5 (sup Gj?\z) - G$(*) 


(E.16) 


2 ex 


^ ^n3 


1 n \ 

> k, g n I ^ 

2=1 / 


(E.17) 


(3) 


Now we turn to Step 4 comparing the suprema of the conditional multiplier processes W 1 
with the suprema of the Gaussian process W° = sup 2gi y G n (z). Recall that G n (z) is the Gaussian 


process satisfying condition HI. Theorem A.2 in 


Chernozhukov et al. 

2014a 

and 

B.7 


imply 


a n 4 = GtJ log n/n + CTogn • h 3 / 4 / n x / 4 = o(n 1//10 ) such that on A 5 

(|t?( 3 ) - W° | > a n4 ) < 2 /n, 


for sufficiently large n 

Step 5 has been done in the main proof of Theorem 


W n - W° 


4.4 


in Section 


B.l 


> e' ln ^ < 5' ln where e\ n = o(n 1 / 10 ), 5' ln = 0 {n 1 / 10 ) according to 


B.f 


(E.18) 
gives that 


B.9 


In summary, the chain of approximations in |E. 12| , [E.15| , (E.17| and FI. 18| derives that 
on the event A = A\ H A 2 n A 3 n A 4 n A 5 , we have P^(|VE n — W n \ > w n \ A) < Cn~ 1//10 . Since 
P(A) > 1 — C/n, we have 


W n -W n 


> w n < 


n I ^ 


W n -w n 


> w n \ A) + 


: )< 


C 


n 


1 / 10 ’ 


(E.19) 


where w n = e' nl + a n \ + a n 2 + a n 3 + a n4 = 0(n 1 / 10 ). This completes the proof of the lemma. 


E.3 Proof Lemma B.3 


We bound the difference between Wjf = sup 2g ^ Z n (z) and W n == sup,^ G n (z). We also introduce 
an intermediate process 




n ^ c r \/^ _ 1 A(A")p 1 (z) 


(E.20) 
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Similar to the proof Lemma B.3 here we bound sup, g ^ G n (z) — sup ze ^ G,',' \z) and sup^ g ^ Z n (z)— 


sup^G^). Define AG n (z) := G n (z) — G^fA). Similar to E.16 , we have 


supE (g n (z) — G^fz)) < E 

z£X ' ' 


b n )Kfo^Xii z) 


a 


\/h 1 X(K)pi(z) 


< aCgh-^n- 1 ' 2 . 


Recall that A 4 = {maxj g u e* < b n := C\/logn} and P (A 4 ) > 1—1/n. For a n 5 = 2aC g h 1//2 n 1 / 2 , 
the tail probability can be bounded as 


sup AG„(z) > a n5 
zex 


= P sup AG n (z) > a n5 I A 4 P(^ 4 ) + P sup AG n (z) > a n5 | A\ P(^) 
\z<=X J \z£X 


(E.21) 


< P ( supE[G n (z) - G®(x)] > a n 5 | A 5 ) + P(^) < - 


Therefore for W n = sup zg ^ G„,(z) and Wr^ = sup zgi ^ G^(z), we have 


Wjp - W n 


Next, we study Z n (z) — Gn^(z). Using 8.10 from the proof of Theorem 


> a n 5 ) < P ( sup AG„(z) > a n5 ) < 1/n. 
. 26 * 


(E.22) 


4.3 


we can write 


WUe.W+A.Wll#!. where A„ W = AG„M + ' /H ' (h{z) + 




y/( 7 2 X(K) Pl (z) 


where 1‘Az) and I%(z) are defined in 8.10 . Using 


8.12 


8.13 


E .6 


and 

for 5 > 1/5 + £q for some £q > 0 , there exists a constant C such that 


~ T VS(I 2 (z) + I 3 (z))p(z) > Cn _« 0 \ < 3 
zex x /a 2 \(K)p 1 (z) J n 


, if /i x n 5 and m x n d 


Combining with E.22 , we have the convergence rate of sup zg ^ A n (z) such that 

P ( sup A n (z) > u n ^/logn) < 4/n, 

VaS* J 


where u n = n £ ° / \J\ogn with tt n logn = o(l). Therefore, we have shown that 

P (\Wn ~ W n \ > Cn" 1/10 ) < cn" 1 / 10 , 

which completes the proof. 
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E.4 Proof of Lemma 15.41 

Recall that 5R = n _ 1 4 , W,'5 fi and X 2 = E[R/j(Xi — By Holder inequality, we have 


^z@z ~ e l 


2,00 


- ||(£* - ^z)Oz 11 2,00 - VH|£* - S z|| max ||^||l- 


(E.23) 


Following the same argument as in the proof of Lemma D.l. there exists a constant such that with 
probability 1 — c/n 


E.24 


Plugging _ 

and the proof is complete. 


sup ||(E z - 5L)|| max < C 
zex 

and sup 2e ^ \\0 Z \\ \ < C-s/m into 


log(dm) 1 2 


i S—+ h 
nmh nh 


E.23 


(E.24) 

we therefore obtain the desire upper bound 


E.5 Proof of Lemma B.6 


Applying the fact that djH Z 0 Z < we have the following inequality 


O z — 6 Z ) Yj z ( 9 Z — 6 Z ) — Ti z 6 z — 2djll Z 6 Z + dj^ z d z 


< 2 o T z %e z - 2 (ej± z - e ?: ) e z - 2efe z 
= 20T (i e z - 2 (G t ± z - e,;l e z 


— 2||^||?||S* — cr||max + 2||S,0 Z — ei || 2 , 00 1|111- 

Plugging 5.12| , {E.24| and Assumption (H) into the above inequality, we have the rate in B.18 

E.6 Proof of Lemma |B.41 


Similar to 8.10 , we can expand the difference between two processes as 

K(z) - Z' n (z) = Kh(Xi 1 - z)rf^lB z + Vnh^ - %2 Z )0+ - /3+), 


where rf t is defined in 


8.11 


Ti(z) 

Using Theorem 


5.1 


and Lemma 


5.4 


t 2 (z) 


with probability 1 — c/n, we have 


sup |T 2 (z)| < Vnh\\T, z 6 z - ei || 2 ,oo ||3+ - P+h,: 


z&X 


< 


y V nh v nh 


\og(dmh~ l ) yfs + m 3 / 2 log(dh x ) h 2 


n 


nh ' m 5 / 2 

Since mh = o(l) and h x n -5 for S > 1/5, we have sup ze ^ |T 2 (z)| = op(n _1//1 °). 


m 


(E.25) 
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To bound Ti(z), we first apply the triangle inequality and Cauchy-Schwartz inequality to 
decompose T±(z) into three smaller fragments 


Ti(z) = \fhfn R h ( x n ~ z)r,'*Z(e z - 9 Z ) + y/h/n^ K h( x n ~ z)r,'*J.O z 

i=l i= 1 

< Vnh ■ tI[ 2 (z) ■ tI -2 2 (z) + Vnh ■ T\ 3 (z), 
where the three processes T\ \ , T \2 and T 13 are defined as follows 

-x n 1 n 

Tn(z) = - E K V x n ~ *)(*£&~ d V)\ T ^ z ) = - E K V x n ~ z )^) 2 

n z — J n z ' 

2=1 2=1 


and T 13 (z) = - E K h( x i 1 “ z)rji^J,0 z . 
n z — J 

From Lemma 


2=1 


B.6 


we can bound the supreme of Tn(z) by 


sup | Tii (z) | = sup [O Z -0 Z ) ( 9~ - 6 Z ) < Cyfrn 

zS* z&X 


rn log (dm) m , 9 , 

-^+ -== + mh 2 . (E.26) 

nh y/^h 1 


Let S,£ z and C, z be as defined in Lemma 8.2 and Lemma 8.3 From those two lemmas, with 
probability 1 — c/n, we have 


sup \T 12 (z)\ < sup — ||wy 2 ^||l + sup-||wy 2 (& + 0) III < c{sm 4 + h 2 ) 
z^x z&x n zeAf n 


(E.27) 


Lemma 8.2 and Lemma 8.3 also give us 


1 


sup |Ti 3 (z)| < sup -||^ W z (5 + £ z + Cz) ||2,ooII^zII 1 
zex zex n 


< Cy/rn 


(V^ • m- 5/2 + \l hlo ^ dh ~^ + m 3/2 log(dh !) + 
\ V n n 


h 2 


(E.28) 


Combining 


10(5 — l/5)/3, we have 


E.26 

5 

E.27 

with 

E.28 


if h x n 5 for 5 > 1/5 and m x n p for 0 < p < 


sup |Ti( 2 )| < CVnhm 3 / 4 h 2 = o(n c ). 

z&X 


Combining this inequality with the rate of sup. g ^ |Ti(z)| in E.25 , we have our lemma proved. 


E.7 Proof of Lemma B.5 


We first bound the difference between QT’E,' z O z and 9TY,' z 9 z by applying triangle inequality and 
Cauchy-Schwartz inequality 


eZKOz-eZKOz 


< (0 Z - 0 z ) T K(0z - 9z) + 2 (9 Z - Oz) 1 KO 


Tw , 


(E.29) 


< h-\0 z - 0 z ) T ± z (e z - 0 Z ) + 2 h- l ' 2 J{0 z - 9 Z ) T X Z (0 Z - 9 z )J9jiy z 9 z . 
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From Lemma B.6 we have the desired upper bound in the lemma that 


sup (0 Z - 9 Z ) T 'S, Z (9 Z - 9 Z ) < Cyfm [ 

z&X 


( mlos(dm) m , 9 

y^t^ + vs +mh , 


(E.30) 


The following lemma gives us a bound on the term 9fYf z 9 z . 

Lemma E.2. Under Assumption (Al), for any z E A, 

nK] = h~ 1 X(K)[Z z + o(h)\, 

where A (K) = f K 2 (u)du. Furthermore, with probability at least 1 — c/n, 


sup 11 Yj! z — E [E^] 11 max < C j —^2 —/ , q + 


nh 2 yjnh 3 V nmh 3 


log(dm) 


We defer the proof of the lemma to the end of the section. Using Lemma E.2 we have 

9 t z Y! z 9 z > 0jE[S^]0 2 - he; -E[E'J max ||0 2 ||? 

> h~ 1 \{K)9 T z %9 z - ||S; - E[E;]|| max ||6>,||? - o(l) 


> h X(K)e{ 9 Z - Cm + — + 


nh 2 Vnh 3 V nmh 3 


log(dm) 


(E.31) 


-o(l). 


We can also bound from the other direction as 


9 t z H' z 9 z < h~ l \(K)ef9 z + Cm I — + 


,-i 


T L 


1 


1 


+ 


log(dm) 


+ o(l). 


(E.32) 


Combining E.29 


E.30 


E.31 


and 


constant c such that for any z € X, 


E.32 


nh 2 Vnh 3 V nmh 3 

, if mh = o(l),/i = n~ s for 6 > 1/5, there exists a 


9 t z ^ z 9 z >9 t z ^' z 9 z - 


dIKA-9jK9 z 


= h 1 (X(K)ef0 z + o(l)) > ch 1 e\9 z 


Similarly, we also have 9T E ' Z 9 Z < Ch i e( 9 Z . The proof will be done once we prove Lemma 




E.2 


Proof of Lemma E.2 For any j, j' E [d] and k, k' E [m], we have 

E[E'^ W = J ^(z-^kfo)^ 

= h~ l J K 2 (u)(iJj j k(xj)'ip j'k'{xj'))pi,j,j'(z + uh , Xj,Xji)dudxjdxjt 
= h~ l \(I\) J K{u){^ jk {xj)y> j>k'( x j'))(pi,j,j'{z, Xj,xjt ) + o(h))dudxjdxj> 
= h~ 1 \(K)\ : £ z + o(h ) 

1 Jjj'fcfe' 
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The second part of the proof is similar to the proof of Lemma D.l Consider the random variable 


Zkk'jj' — sup(E n E)[A^(.Xii z)ip k ( X%j (-2Qj')] • 

zG* 

Let i.i.d. Rademacher variables, the symmetrization inequality (see Lemma 2.3.1 in 


Vaart and Wellner 


1996 


van der 


gives us 


E[ Zkk'jj'] < 2E 


sup ^Kf l (Xn z)'ipk(Xij)ilJk l (Xij/) 


.zex 


Define the following two function classes 

Qh = {g z (xi,X2,x 3 ) = h~ 2 K 2 (hr 1 (x\ - z))^ k (x2)^k'(x^) z € X,xi,X2,x 3 G dfj and 
Tl = {/i,- 2 iL 2 (/i _1 (- - z)) | z G X} . 

Using Lemma [F.2 we bound the covering number by 


sup 

Q 




where Q is any measure on M. Therefore the covering number for satisfies 


N(g k ,L 2 (r),e) < (MMfeli)' 


The envelope of Q k is U = Ah 2 ||J\ ||oo and we bound the variance of the process by 

4 := E [(Kl (X x - z) (MXjWk'iXj,)) 2 

= h~ 3 E [K 2 (h~ 1 (X l - z)) E [(^(X^liXy) | Ad]] < Cm~ 2 h~ 3 . 
Using Lemma[~F.l| we obtain the upper bound of the expectation 


I ^[Zkk'jj'] < 2E 

i —2 Jl 


sup CiK h (Xn - z)ijj k (X i j)if> k '(X ij ') 


z£X 


<C 1 


\og(C2m) 
nm 2 h 3 


As \Z kk ijji\ < Ah 2 and < Cm 2 h Lemma 


F.3 


gives us 


Zkk'jj' > E [Zkk'jj'] + t\J Cm~ 2 h~ 3 + 4L _2 E[Zfcfc/jj/] + 4f 2 L 2 /3^ < exp(— nt 2 ). 
By letting t = 2>^J\og(dm)/n, we obtain 

= °p 4 + 


llog(dm) 
nm 2 h 3 


(E.33) 


We also define the empirical process 


Z kj = sup - Y, K h{Xi i - - E [K h (X i - • 


n 7=1 
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As above, we can show that the suprema of the empirical process has the convergence rate as 


supmaj | KU, 1) * ES-,0, Dl =Or(^2 + V 1 ■ 

Finally, we have the following upper bound 

1 n 

sup |S' 2 (1,1) - EE' 2 (1,1)| < sup - V K 2 h {X n -z)- mliXi ~ z)\ 

— v 1 1 — v n z — J 


zGA’ 


zGA' 


2=1 


<cii + Al 


Vnh ? nh 2 J ' 


Combining 


E.33 


, E.34 and E.35 , with probability at least 1 — c/n, we have 


supI, S - EK] |, 2 £ cCL + ^ + ^ 


which completes the proof of the Lemma. 


(E.34) 


(E.35) 


Hi 


F Results on Empirical Processes 


Lemma F.l (Theorem 3.12, Koltchinskii 2011 ). Assume that the functions in T defined on X 
are uniformly bounded by a constant U and F(-) is the envelope of T such that |/(x)| < F(x) for 
all x £ X and / £ T. Define dp = supj g pE[/ 2 ]. Let Xi, ..., X n be i.i.d. copies of the random 
variable X. We denote the empirical measure as P n = n -1 E))Li If f° r some A, V > 0 and for 
all e > 0 and n > 1, the covering entropy satisfies 


N(X,L 2 {F n y,e) < 


^||F|| L 2 ( Pn ) 


V 


then for any i.i.d. subgaussian mean zero random variables £\,...,e n there exits a universal constant 
C such that 


E 


sup 

L/e t n rz 


1 n 1 

-Y^SifiXa) < C 

i=1 J 

Furthermore, we also have 

T 1 n 

sup-^(/(Wr)-E f(X)) 

r -- t — 71 A - * 


V a P /log A||F||l2 ( p > + VE bg A||i?ll ^ 2 ( p ) 
n V ap n Gp 


E 


i=1 


< c 


Y_ a L o + i og awfwliqp) 

n V up n up 


Lemma F.2 (Lemma 3, Gine and Nickl 1 2009 ). Let K : M e- > M be a bounded variation function. 
Define the function class Th = {K((t — -)/h) \t £ M}. There exists A < oo such that for all 
probability measures Q on I, we have 

supA^(J r /l ,L 2 (Q),e) < , for any e £ (0,1). 
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Lemma F.3 


Bousquet 


2002 


). Let X\,, X n be independent random variables and T is a 
function class such that there exist ?y n and satisfying 


sup 

feT 


CO — 


1 


r) n and sup - V] Var(/(X a )) < r, 

f^ n r^ 


i— 1 


Define the random variable Z being the suprema of an empirical process 


Z = sup 
/ 6-F 


1 " 


i=i 


Then for any 2 > 0, we have the following concentration inequality on the suprema 
P ^Z > EZ + z\J2{t^[ + 2r] n EZ ) + 2z 2 r) n /^j < exp(—raz 2 ). 
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